seanpedrickcase commited on
Commit
93b4c8a
·
1 Parent(s): 8953ca0

Improved efficiency of review page navigation, especially for large documents. Updated user guide

Browse files
Files changed (8) hide show
  1. .dockerignore +3 -1
  2. .gitignore +3 -1
  3. README.md +224 -56
  4. app.py +6 -6
  5. tools/config.py +2 -2
  6. tools/file_conversion.py +605 -285
  7. tools/file_redaction.py +5 -30
  8. tools/redaction_review.py +518 -249
.dockerignore CHANGED
@@ -17,4 +17,6 @@ dist/*
17
  build_deps/*
18
  logs/*
19
  config/*
20
- user_guide/*
 
 
 
17
  build_deps/*
18
  logs/*
19
  config/*
20
+ user_guide/*
21
+ cdk/*
22
+ web/*
.gitignore CHANGED
@@ -18,4 +18,6 @@ build_deps/*
18
  logs/*
19
  config/*
20
  doc_redaction_amplify_app/*
21
- user_guide/*
 
 
 
18
  logs/*
19
  config/*
20
  doc_redaction_amplify_app/*
21
+ user_guide/*
22
+ cdk/*
23
+ web/*
README.md CHANGED
@@ -20,6 +20,12 @@ NOTE: The app is not 100% accurate, and it will miss some personal information.
20
 
21
  # USER GUIDE
22
 
 
 
 
 
 
 
23
  ## Table of contents
24
 
25
  - [Example data files](#example-data-files)
@@ -35,57 +41,97 @@ NOTE: The app is not 100% accurate, and it will miss some personal information.
35
  - [Reviewing and modifying suggested redactions](#reviewing-and-modifying-suggested-redactions)
36
 
37
  See the [advanced user guide here](#advanced-user-guide):
38
- - [Modifying and merging redaction review files](#modifying-and-merging-redaction-review-files)
39
- - [Modifying existing redaction review files](#modifying-existing-redaction-review-files)
40
- - [Merging existing redaction review files](#merging-existing-redaction-review-files)
41
  - [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
42
  - [Fuzzy search and redaction](#fuzzy-search-and-redaction)
43
  - [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
44
  - [Exporting to Adobe Acrobat](#exporting-to-adobe-acrobat)
45
  - [Importing from Adobe Acrobat](#importing-from-adobe-acrobat)
 
46
  - [Using AWS Textract and Comprehend when not running in an AWS environment](#using-aws-textract-and-comprehend-when-not-running-in-an-aws-environment)
 
47
 
48
  ## Example data files
49
 
50
- Please refer to these example files to follow this guide:
51
  - [Example of files sent to a professor before applying](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_of_emails_sent_to_a_professor_before_applying.pdf)
52
  - [Example complaint letter (jpg)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_complaint_letter.jpg)
53
  - [Partnership Agreement Toolkit (for signatures and more advanced usage)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/Partnership-Agreement-Toolkit_0_0.pdf)
 
54
 
55
  ## Basic redaction
56
 
57
- The document redaction app can detect personally-identifiable information (PII) in documents. Documents can be redacted directly, or suggested redactions can be reviewed and modified using a grapical user interface.
58
 
59
  Download the example PDFs above to your computer. Open up the redaction app with the link provided by email.
60
 
61
  ![Upload files](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/file_upload_highlight.PNG)
62
 
63
- Click on the upload files area, and select the three different files (they should all be stored in the same folder if you want them to be redacted at the same time).
 
 
 
 
64
 
65
- First, select one of the three text extraction options below:
66
- - 'Local model - selectable text' - This will read text directly from PDFs that have selectable text to redact (using PikePDF). This is fine for most PDFs, but will find nothing if the PDF does not have selectable text, and it is not good for handwriting or signatures. If it encounters an image file, it will send it onto the second option below.
67
- - 'Local OCR model - PDFs without selectable text' - This option will use a simple Optical Character Recognition (OCR) model (Tesseract) to pull out text from a PDF/image that it 'sees'. This can handle most typed text in PDFs/images without selectable text, but struggles with handwriting/signatures. If you are interested in the latter, then you should use the third option if available.
68
- - 'AWS Textract service - all PDF types' - Only available for instances of the app running on AWS. AWS Textract is a service that performs OCR on documents within their secure service. This is a more advanced version of OCR compared to the local option, and carries a (relatively small) cost. Textract excels in complex documents based on images, or documents that contain a lot of handwriting and signatures.
 
 
 
 
 
 
 
69
 
70
  If you are running with the AWS service enabled, here you will also have a choice for PII redaction method:
71
- - 'Local' - This uses the spacy package to rapidly detect PII in extracted text. This method is often sufficient if you are just interested in redacting specific terms defined in a custom list.
72
- - 'AWS Comprehend' - This method calls an AWS service to provide more accurate identification of PII in extracted text.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
 
74
- Hit 'Redact document'. After loading in the document, the app should be able to process about 30 pages per minute (depending on redaction methods chose above). When ready, you should see a message saying that processing is complete, with output files appearing in the bottom right.
 
 
75
 
76
  ![Redaction outputs](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/redaction_outputs.PNG)
77
 
78
- - '...redacted.pdf' files contain the original pdf with suggested redacted text deleted and replaced by a black box on top of the document.
79
- - '...ocr_results.csv' files contain the line-by-line text outputs from the entire document. This file can be useful for later searching through for any terms of interest in the document (e.g. using Excel or a similar program).
80
- - '...review_file.csv' files are the review files that contain details and locations of all of the suggested redactions in the document. This file is key to the [review process](#reviewing-and-modifying-suggested-redactions), and should be downloaded to use later for this.
 
 
 
 
 
 
81
 
82
- Additional outputs are available under the 'Redaction settings' tab. Scroll to the bottom and you should see more files:
83
 
84
- ![Additional processing outputs](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/redaction_additional_outputs.PNG)
85
 
86
- - '...review_file.json' is the same file as the review file above, but in .json format.
87
- - '...decision_process_output.csv' is also similar to the review file above, with a few more details on the location and scores of identified PII in the document.
88
- - If you are using AWS Textract, you should also get a .json file with the Textract outputs. It could be useful to retain this document to avoid having to repeatedly analyse the same document in future (this .json file can be uploaded into the app on the first redaction tab to load into local memory before redaction).
89
 
90
  We have covered redacting documents with the default redaction options. The '...redacted.pdf' file output may be enough for your purposes. But it is very likely that you will need to customise your redaction options, which we will cover below.
91
 
@@ -126,6 +172,16 @@ There may be full pages in a document that you want to redact. The app also prov
126
 
127
  Using the above approaches to allow, deny, and full page redaction lists will give you an output [like this](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/allow_list/Partnership-Agreement-Toolkit_0_0_redacted.pdf).
128
 
 
 
 
 
 
 
 
 
 
 
129
  ### Redacting additional types of personal information
130
 
131
  You may want to redact additional types of information beyond the defaults, or you may not be interested in default suggested entity types. There are dates in the example complaint letter. Say we wanted to redact those dates also?
@@ -146,7 +202,9 @@ Say also we are only interested in redacting page 1 of the loaded documents. On
146
 
147
  ## Handwriting and signature redaction
148
 
149
- The file [Partnership Agreement Toolkit (for signatures and more advanced usage)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/Partnership-Agreement-Toolkit_0_0.pdf) is provided as an example document to test AWS Textract + redaction with a document that has signatures in. If you have access to AWS Textract in the app, try removing all entity types from redaction on the Redaction settings and clicking the big X to the right of 'Entities to redact'. Ensure that handwriting and signatures are enabled for redaction on the Redaction Settings tab(enabled by default):
 
 
150
 
151
  ![Handwriting and signatures](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/textract_handwriting_signatures.PNG)
152
 
@@ -156,72 +214,135 @@ The outputs should show handwriting/signatures redacted (see pages 5 - 7), which
156
 
157
  ## Reviewing and modifying suggested redactions
158
 
159
- Quite often there are certain terms suggested for redaction by the model that don't match quite what you intended. The app allows you to review and modify suggested redactions for the last file redacted. Refresh your browser tab. On the first tab 'PDFs/images' upload the ['Example of files sent to a professor before applying.pdf'](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_of_emails_sent_to_a_professor_before_applying.pdf) file. Let's stick with the 'Local model - selectable text' option, and click 'Redact document'. Once the outputs are created, go to the 'Review redactions' tab.
 
 
160
 
161
- On this tab you have a visual interface that allows you to inspect and modify redactions suggested by the app.
162
 
163
  ![Review redactions](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/review_redactions.PNG)
164
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
165
  You can change the page viewed either by clicking 'Previous page' or 'Next page', or by typing a specific page number in the 'Current page' box and pressing Enter on your keyboard. Each time you switch page, it will save redactions you have made on the page you are moving from, so you will not lose changes you have made.
166
 
167
- On your selected page, each redaction is highlighted with a box next to its suggested entity type. By default the interface allows you to modify existing redaction boxes. Click and hold on an existing box to move it. Click on one of the small boxes at the edges to change the size of the box. To delete a box, click on it to highlight it, then press delete on your keyboard. Alternatively, double click on a box and click 'Remove' on the box that appears.
 
 
 
 
 
 
168
 
169
- To change to 'add new redactions' mode, scroll to the bottom of the page. Click on the box icon, and your cursor will change into a crosshair. Now you can add new redaction boxes where you wish.
170
 
171
  ![Change redaction mode](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/change_review_mode.PNG)
172
 
173
- On the right of the screen there is a dropdown and table where you can filter to entity types that have been found throughout the document. You can choose a specific entity type to see which pages the entity is present on. If you want to go to the page specified in the table, you can click on a cell in the table and the review page will be changed to that page.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
174
 
175
- ![Change redaction mode](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/list_find_labels.PNG)
176
 
177
- Note that the table currently only shows entity types, and not specific found text. So for instance if you provide a list of specific terms to redact in the [deny list](#deny-list-example), they will all be labelled just as 'CUSTOM'. A feature to include in the near term will include being able to view specific redacted text in this table to get a better sense of the PII entities found.
178
 
179
- Once you happy with your modified changes throughout the document, click 'Apply revised redactions' at the top of the page. The app will then run through all the pages in the document to update the redactions, and will output a modified PDF file. The modified PDF will appear at the top of the page in the file area. It will also output a revised '...review_file.csv' that you can then use for future review tasks.
 
 
 
 
 
 
 
 
 
 
 
 
180
 
181
  ![Review modified outputs](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/review_mod_outputs.PNG)
182
 
183
- Any feedback or comments on the app, please get in touch!
184
 
185
- # ADVANCED USER GUIDE
186
 
187
- This advanced user guide will go over some of the features recently added to the app, including: modifying and merging redaction review files, identifying and redacting duplicate pages across multiple PDFs, 'fuzzy' search and redact, and exporting redactions to Adobe Acrobat.
188
 
189
- ## Table of contents
190
 
191
- - [Modifying and merging redaction review files](#modifying-and-merging-redaction-review-files)
192
- - [Modifying existing redaction review files](#modifying-existing-redaction-review-files)
193
- - [Merging existing redaction review files](#merging-existing-redaction-review-files)
194
- - [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
195
- - [Fuzzy search and redaction](#fuzzy-search-and-redaction)
196
- - [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
197
- - [Exporting to Adobe Acrobat](#exporting-to-adobe-acrobat)
198
- - [Importing from Adobe Acrobat](#importing-from-adobe-acrobat)
199
 
 
200
 
201
- ## Modifying and merging redaction review files
 
 
202
 
203
- You can find the folder containing the files discussed in this section [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/).
204
 
205
- As well as serving as inputs to the document redaction app's review function, the 'review_file.csv' output can be modified outside of the app, and also merged with others from multiple redaction attempts on the same file. This gives you the flexibility to change redaction details outside of the app.
 
206
 
207
- ### Modifying existing redaction review files
208
- If you open up a 'review_file' csv output using a spreadsheet software program such as Microsoft Excel you can easily modify redaction properties. Open the file '[Partnership-Agreement-Toolkit_0_0_redacted.pdf_review_file_local.csv](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/Partnership-Agreement-Toolkit_0_0.pdf_review_file_local.csv)', and you should see a spreadshet with just four suggested redactions (see below). The following instructions are for using Excel.
209
 
210
- ![Review file before](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/merge_review_files/img/review_file_before.PNG)
211
 
212
- The first thing we can do is remove the first row - 'et' is suggested as a person, but is obviously not a genuine instance of personal information. Right click on the row number and select delete on this menu. Next, let's imagine that what the app identified as a 'phone number' was in fact another type of number and so we wanted to change the label. Simply click on the relevant label cells, let's change it to 'SECURITY_NUMBER'. You could also use 'Finad & Select' -> 'Replace' from the top ribbon menu if you wanted to change a number of labels simultaneously.
213
 
214
- How about we wanted to change the colour of the 'email address' entry on the redaction review tab of the redaction app? The colours in a review file are based on an RGB scale with three numbers ranging from 0-255. [You can find suitable colours here](https://rgbcolorpicker.com). Using this scale, if I wanted my review box to be pure blue, I can change the cell value to (0,0,255).
215
 
216
- Imagine that a redaction box was slightly too small, and I didn't want to use the in-app options to change the size. In the review file csv, we can modify e.g. the ymin and ymax values for any box to increase the extent of the redaction box. For the 'email address' entry, let's decrease ymin by 5, and increase ymax by 5.
217
 
218
- I have saved an output file following the above steps as '[Partnership-Agreement-Toolkit_0_0_redacted.pdf_review_file_local_mod.csv](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/outputs/Partnership-Agreement-Toolkit_0_0.pdf_review_file_local_mod.csv)' in the same folder that the original was found. Let's upload this file to the app along with the original pdf to see how the redactions look now.
219
 
220
- ![Review file after modification](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/merge_review_files/img/partnership_redactions_after.PNG)
 
 
 
 
 
 
221
 
222
- We can see from the above that we have successfully removed a redaction box, changed labels, colours, and redaction box sizes.
223
 
224
- ### Merging existing redaction review files
 
 
 
 
 
 
 
 
 
 
 
225
 
226
  Say you have run multiple redaction tasks on the same document, and you want to merge all of these redactions together. You could do this in your spreadsheet editor, but this could be fiddly especially if dealing with multiple review files or large numbers of redactions. The app has a feature to combine multiple review files together to create a 'merged' review file.
227
 
@@ -303,6 +424,30 @@ When you click the 'convert .xfdf comment file to review_file.csv' button, the a
303
 
304
  ![Outputs from Adobe import](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/export_to_adobe/img/import_from_adobe_interface_outputs.PNG)
305
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
306
  ## Using AWS Textract and Comprehend when not running in an AWS environment
307
 
308
  AWS Textract and Comprehend give much better results for text extraction and document redaction than the local model options in the app. The most secure way to access them in the Redaction app is to run the app in a secure AWS environment with relevant permissions. Alternatively, you could run the app on your own system while logged in to AWS SSO with relevant permissions.
@@ -322,4 +467,27 @@ AWS_SECRET_KEY= your-secret-key
322
 
323
  The app should then pick up these keys when trying to access the AWS Textract and Comprehend services during redaction.
324
 
325
- Again, a lot can potentially go wrong with AWS solutions that are insecure, so before trying the above please consult with your AWS and data security teams.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
  # USER GUIDE
22
 
23
+ ## Experiment with the test (public) version of the app
24
+ You can test out many of the features described in this user guide at the [public test version of the app](https://huggingface.co/spaces/seanpedrickcase/document_redaction), which is free. AWS functions (e.g. Textract, Comprehend) are not enabled (unless you have valid API keys).
25
+
26
+ ## Chat over this user guide
27
+ You can now [speak with a chat bot about this user guide](https://huggingface.co/spaces/seanpedrickcase/Light-PDF-Web-QA-Chatbot) (beta!)
28
+
29
  ## Table of contents
30
 
31
  - [Example data files](#example-data-files)
 
41
  - [Reviewing and modifying suggested redactions](#reviewing-and-modifying-suggested-redactions)
42
 
43
  See the [advanced user guide here](#advanced-user-guide):
44
+ - [Merging redaction review files](#merging-redaction-review-files)
 
 
45
  - [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
46
  - [Fuzzy search and redaction](#fuzzy-search-and-redaction)
47
  - [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
48
  - [Exporting to Adobe Acrobat](#exporting-to-adobe-acrobat)
49
  - [Importing from Adobe Acrobat](#importing-from-adobe-acrobat)
50
+ - [Using the AWS Textract document API](#using-the-aws-textract-document-api)
51
  - [Using AWS Textract and Comprehend when not running in an AWS environment](#using-aws-textract-and-comprehend-when-not-running-in-an-aws-environment)
52
+ - [Modifying existing redaction review files](#modifying-existing-redaction-review-files)
53
 
54
  ## Example data files
55
 
56
+ Please try these example files to follow along with this guide:
57
  - [Example of files sent to a professor before applying](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_of_emails_sent_to_a_professor_before_applying.pdf)
58
  - [Example complaint letter (jpg)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_complaint_letter.jpg)
59
  - [Partnership Agreement Toolkit (for signatures and more advanced usage)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/Partnership-Agreement-Toolkit_0_0.pdf)
60
+ - [Dummy case note data](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/combined_case_notes.csv)
61
 
62
  ## Basic redaction
63
 
64
+ The document redaction app can detect personally-identifiable information (PII) in documents. Documents can be redacted directly, or suggested redactions can be reviewed and modified using a grapical user interface. Basic document redaction can be performed quickly using the default options.
65
 
66
  Download the example PDFs above to your computer. Open up the redaction app with the link provided by email.
67
 
68
  ![Upload files](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/file_upload_highlight.PNG)
69
 
70
+ ### Upload files to the app
71
+
72
+ The 'Redact PDFs/images tab' currently accepts PDFs and image files (JPG, PNG) for redaction. Click on the 'Drop files here or Click to Upload' area of the screen, and select one of the three different [example files](#example-data-files) (they should all be stored in the same folder if you want them to be redacted at the same time).
73
+
74
+ ### Text extraction
75
 
76
+ First, select one of the three text extraction options:
77
+ - **'Local model - selectable text'** - This will read text directly from PDFs that have selectable text to redact (using PikePDF). This is fine for most PDFs, but will find nothing if the PDF does not have selectable text, and it is not good for handwriting or signatures. If it encounters an image file, it will send it onto the second option below.
78
+ - **'Local OCR model - PDFs without selectable text'** - This option will use a simple Optical Character Recognition (OCR) model (Tesseract) to pull out text from a PDF/image that it 'sees'. This can handle most typed text in PDFs/images without selectable text, but struggles with handwriting/signatures. If you are interested in the latter, then you should use the third option if available.
79
+ - **'AWS Textract service - all PDF types'** - Only available for instances of the app running on AWS. AWS Textract is a service that performs OCR on documents within their secure service. This is a more advanced version of OCR compared to the local option, and carries a (relatively small) cost. Textract excels in complex documents based on images, or documents that contain a lot of handwriting and signatures.
80
+
81
+ ### Optional - select signature extraction
82
+ If you chose the AWS Textract service above, you can choose if you want handwriting and/or signatures redacted by default. Choosing signatures here will have a cost implication, as identifying signatures will cost ~£2.66 ($3.50) per 1,000 pages vs ~£1.14 ($1.50) per 1,000 pages without signature detection.
83
+
84
+ ![AWS Textract handwriting and signature options](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/textract_handwriting_signatures.PNG)
85
+
86
+ ### PII redaction method
87
 
88
  If you are running with the AWS service enabled, here you will also have a choice for PII redaction method:
89
+ - **'Only extract text - (no redaction)'** - If you are only interested in getting the text out of the document for further processing (e.g. to find duplicate pages, or to review text on the Review redactions page)
90
+ - **'Local'** - This uses the spacy package to rapidly detect PII in extracted text. This method is often sufficient if you are just interested in redacting specific terms defined in a custom list.
91
+ - **'AWS Comprehend'** - This method calls an AWS service to provide more accurate identification of PII in extracted text.
92
+
93
+ ### Optional - costs and time estimation
94
+ If the option is enabled (by your system admin, in the config file), you will see a cost and time estimate for the redaction process. 'Existing Textract output file found' will be checked automatically if previous Textract text extraction files exist in the output folder, or have been [previously uploaded by the user](#aws-textract-outputs) (saving time and money for redaction).
95
+
96
+ ![Cost and time estimation](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/costs_and_time.PNG)
97
+
98
+ ### Optional - cost code selection
99
+ If the option is enabled (by your system admin, in the config file), you may be prompted to select a cost code before continuing with the redaction task.
100
+
101
+ ![Cost code selection](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/cost_code_selection.PNG)
102
+
103
+ The relevant cost code can be found either by: 1. Using the search bar above the data table to find relevant cost codes, then clicking on the relevant row, or 2. typing it directly into the dropdown to the right, where it should filter as you type.
104
+
105
+ ### Optional - Submit whole documents to Textract API
106
+ If this option is enabled (by your system admin, in the config file), you will have the option to submit whole documents in quick succession to the AWS Textract service to get extracted text outputs quickly (faster than using the 'Redact document' process described here). This feature is described in more detail in the [advanced user guide](#using-the-aws-textract-document-api).
107
+
108
+ ![Textract document API](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/textract_document_api.PNG)
109
+
110
+ ### Redact the document
111
 
112
+ Click 'Redact document'. After loading in the document, the app should be able to process about 30 pages per minute (depending on redaction methods chose above). When ready, you should see a message saying that processing is complete, with output files appearing in the bottom right.
113
+
114
+ ### Redaction outputs
115
 
116
  ![Redaction outputs](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/redaction_outputs.PNG)
117
 
118
+ - **'...redacted.pdf'** files contain the original pdf with suggested redacted text deleted and replaced by a black box on top of the document.
119
+ - **'...ocr_results.csv'** files contain the line-by-line text outputs from the entire document. This file can be useful for later searching through for any terms of interest in the document (e.g. using Excel or a similar program).
120
+ - **'...review_file.csv'** files are the review files that contain details and locations of all of the suggested redactions in the document. This file is key to the [review process](#reviewing-and-modifying-suggested-redactions), and should be downloaded to use later for this.
121
+
122
+ ### Additional AWS Textract outputs
123
+
124
+ If you have used the AWS Textract option for extracting text, you may also see a '..._textract.json' file. This file contains all the relevant extracted text information that comes from the AWS Textract service. You can keep this file and upload it at a later date alongside your input document, which will enable you to skip calling AWS Textract every single time you want to do a redaction task, as follows:
125
+
126
+ ![Document upload alongside Textract](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/document_upload_with_textract.PNG)
127
 
128
+ ### Downloading output files from previous redaction tasks
129
 
130
+ If you are logged in via AWS Cognito and you lose your app page for some reason (e.g. from a crash, reloading), it is possible recover your previous output files, provided the server has not been shut down since you redacted the document. Go to 'Redaction settings', then scroll to the bottom to see 'View all output files from this session'.
131
 
132
+ ![View all output files](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/view_all_output_files.PNG)
133
+
134
+ ### Basic redaction summary
135
 
136
  We have covered redacting documents with the default redaction options. The '...redacted.pdf' file output may be enough for your purposes. But it is very likely that you will need to customise your redaction options, which we will cover below.
137
 
 
172
 
173
  Using the above approaches to allow, deny, and full page redaction lists will give you an output [like this](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/allow_list/Partnership-Agreement-Toolkit_0_0_redacted.pdf).
174
 
175
+ #### Adding to the loaded allow, deny, and whole page lists in-app
176
+
177
+ If you open the accordion below the allow list options called 'Manually modify custom allow...', you should be able to see a few tables with options to add new rows:
178
+
179
+ ![Manually modify allow or deny list](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/allow_list/manually_modify.PNG)
180
+
181
+ If the table is empty, you can add a new entry, you can add a new row by clicking on the '+' item below each table header. If there is existing data, you may need to click on the three dots to the right and select 'Add row below'. Type the item you wish to keep/remove in the cell, and then (important) press enter to add this new item to the allow/deny/whole page list. Your output tables should look something like below.
182
+
183
+ ![Manually modify allow or deny list filled](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/allow_list/manually_modify_filled.PNG)
184
+
185
  ### Redacting additional types of personal information
186
 
187
  You may want to redact additional types of information beyond the defaults, or you may not be interested in default suggested entity types. There are dates in the example complaint letter. Say we wanted to redact those dates also?
 
202
 
203
  ## Handwriting and signature redaction
204
 
205
+ The file [Partnership Agreement Toolkit (for signatures and more advanced usage)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/Partnership-Agreement-Toolkit_0_0.pdf) is provided as an example document to test AWS Textract + redaction with a document that has signatures in. If you have access to AWS Textract in the app, try removing all entity types from redaction on the Redaction settings and clicking the big X to the right of 'Entities to redact'.
206
+
207
+ To ensure that handwriting and signatures are enabled (enabled by default), on the front screen go the 'AWS Textract signature detection' to enable/disable the following options :
208
 
209
  ![Handwriting and signatures](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/textract_handwriting_signatures.PNG)
210
 
 
214
 
215
  ## Reviewing and modifying suggested redactions
216
 
217
+ Sometimes the app will suggest redactions that are incorrect, or will miss personal information entirely. The app allows you to review and modify suggested redactions to compensate for this. You can do this on the 'Review redactions' tab.
218
+
219
+ We will go through ways to review suggested redactions with an example.On the first tab 'PDFs/images' upload the ['Example of files sent to a professor before applying.pdf'](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_of_emails_sent_to_a_professor_before_applying.pdf) file. Let's stick with the 'Local model - selectable text' option, and click 'Redact document'. Once the outputs are created, go to the 'Review redactions' tab.
220
 
221
+ On the 'Review redactions' tab you have a visual interface that allows you to inspect and modify redactions suggested by the app. There are quite a few options to look at, so we'll go from top to bottom.
222
 
223
  ![Review redactions](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/review_redactions.PNG)
224
 
225
+ ### Uploading documents for review
226
+
227
+ The top area has a file upload area where you can upload original, unredacted PDFs, alongside the '..._review_file.csv' that is produced by the redaction process. Once you have uploaded these two files, click the 'Review PDF...' button to load in the files for review. This will allow you to visualise and modify the suggested redactions using the interface below.
228
+
229
+ Optionally, you can also upload one of the '..._ocr_output.csv' files here that comes out of a redaction task, so that you can navigate the extracted text from the document.
230
+
231
+ ![Search extracted text](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/search_extracted_text.PNG)
232
+
233
+ You can upload the three review files in the box (unredacted document, '..._review_file.csv' and '..._ocr_output.csv' file) before clicking 'Review PDF...', as in the image below:
234
+
235
+ ![Upload three files for review](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/upload_three_files.PNG)
236
+
237
+ **NOTE:** ensure you upload the ***unredacted*** document here and not the redacted version, otherwise you will be checking over a document that already has redaction boxes applied!
238
+
239
+ ### Page navigation
240
+
241
  You can change the page viewed either by clicking 'Previous page' or 'Next page', or by typing a specific page number in the 'Current page' box and pressing Enter on your keyboard. Each time you switch page, it will save redactions you have made on the page you are moving from, so you will not lose changes you have made.
242
 
243
+ You can also navigate to different pages by clicking on rows in the tables under 'Search suggested redactions' to the right, or 'search all extracted text' (if enabled) beneath that.
244
+
245
+ ### The document viewer pane
246
+
247
+ On the selected page, each redaction is highlighted with a box next to its suggested redaction label (e.g. person, email).
248
+
249
+ ![Document view pane](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/document_viewer_pane.PNG)
250
 
251
+ There are a number of different options to add and modify redaction boxes and page on the document viewer pane. To zoom in and out of the page, use your mouse wheel. To move around the page while zoomed, you need to be in modify mode. Scroll to the bottom of the document viewer to see the relevant controls. You should see a box icon, a hand icon, and two arrows pointing counter-clockwise and clockwise.
252
 
253
  ![Change redaction mode](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/change_review_mode.PNG)
254
 
255
+ Click on the hand icon to go into modify mode. When you click and hold on the document viewer, This will allow you to move around the page when zoomed in. To rotate the page, you can click on either of the round arrow buttons to turn in that direction.
256
+
257
+ **NOTE:** When you switch page, the viewer will stay in your selected orientation, so if it looks strange, just rotate the page again and hopefully it will look correct!
258
+
259
+ #### Modify existing redactions (hand icon)
260
+
261
+ After clicking on the hand icon, the interface allows you to modify existing redaction boxes. When in this mode, you can click and hold on an existing box to move it.
262
+
263
+ ![Modify existing redaction box](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/modify_existing_redaction_box.PNG)
264
+
265
+ Click on one of the small boxes at the edges to change the size of the box. To delete a box, click on it to highlight it, then press delete on your keyboard. Alternatively, double click on a box and click 'Remove' on the box that appears.
266
+
267
+ ![Remove existing redaction box](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/existing_redaction_box_remove.PNG)
268
+
269
+ #### Add new redaction boxes (box icon)
270
 
271
+ To change to 'add redaction boxes' mode, scroll to the bottom of the page. Click on the box icon, and your cursor will change into a crosshair. Now you can add new redaction boxes where you wish. A popup will appear when you create a new box so you can select a label and colour for the new box.
272
 
273
+ #### 'Locking in' new redaction box format
274
 
275
+ It is possible to lock in a chosen format for new redaction boxes so that you don't have the popup appearing each time. When you make a new box, select the options for your 'locked' format, and then click on the lock icon on the left side of the popup, which should turn blue.
276
+
277
+ ![Lock redaction box format](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/new_redaction_box_lock_mode.PNG)
278
+
279
+ You can now add new redaction boxes without a popup appearing. If you want to change or 'unlock' the your chosen box format, you can click on the new icon that has appeared at the bottom of the document viewer pane that looks a little like a gift tag. You can then change the defaults, or click on the lock icon again to 'unlock' the new box format - then popups will appear again each time you create a new box.
280
+
281
+ ![Change or unlock redaction box format](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/change_review_mode_with_lock.PNG)
282
+
283
+ ### Apply redactions to PDF and Save changes on current page
284
+
285
+ Once you have reviewed all the redactions in your document and you are happy with the outputs, you can click 'Apply revised redactions to PDF' to create a new '_redacted.pdf' output alongside a new '_review_file.csv' output.
286
+
287
+ If you are working on a page and haven't saved for a while, you can click 'Save changes on current page to file' to ensure that they are saved to an updated 'review_file.csv' output.
288
 
289
  ![Review modified outputs](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/review_mod_outputs.PNG)
290
 
291
+ ### Selecting and removing redaction boxes using the 'Search suggested redactions' table
292
 
293
+ The table shows a list of all the suggested redactions in the document alongside the page, label, and text (if available).
294
 
295
+ ![Search suggested redaction area](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/list_find_labels.PNG)
296
 
297
+ If you click on one of the rows in this table, you will be taken to the page of the redaction. Clicking on a redaction row on the same page *should* change the colour of redaction box to blue to help you locate it in the document viewer (just in app, not in redaction output PDFs).
298
 
299
+ ![Search suggested redaction area](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/review_row_highlight.PNG)
 
 
 
 
 
 
 
300
 
301
+ You can choose a specific entity type to see which pages the entity is present on. If you want to go to the page specified in the table, you can click on a cell in the table and the review page will be changed to that page.
302
 
303
+ To filter the 'Search suggested redactions' table you can:
304
+ 1. Click on one of the dropdowns (Redaction category, Page, Text), and select an option, or
305
+ 2. Write text in the 'Filter' box just above the table. Click the blue box to apply the filter to the table.
306
 
307
+ Once you have filtered the table, you have a few options underneath on what you can do with the filtered rows:
308
 
309
+ - Click the 'Exclude specific row from redactions' button to remove only the redaction from the last row you clicked on from the document.
310
+ - Click the 'Exclude all items in table from redactions' button to remove all redactions visible in the table from the document.
311
 
312
+ **NOTE**: After excluding redactions using either of the above options, click the 'Reset filters' button below to ensure that the dropdowns and table return to seeing all remaining redactions in the document.
 
313
 
314
+ If you made a mistake, click the 'Undo last element removal' button to restore the Search suggested redactions table to its previous state (can only undo the last action).
315
 
316
+ ### Navigating through the document using the 'Search all extracted text'
317
 
318
+ The 'search all extracted text' table will contain text if you have just redacted a document, or if you have uploaded a '..._ocr_output.csv' file alongside a document file and review file on the Review redactions tab as [described above](#uploading-documents-for-review).
319
 
320
+ You can navigate through the document using this table. When you click on a row, the Document viewer pane to the left will change to the selected page.
321
 
322
+ ![Search suggested redaction area](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/select_extracted_text.PNG)
323
 
324
+ You can search through the extracted text by using the search bar just above the table, which should filter as you type. To apply the filter and 'cut' the table, click on the blue tick inside the box next to your search term. To return the table to its original content, click the button below the table 'Reset OCR output table filter'.
325
+
326
+ ![Search suggested redaction area](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/search_extracted_text.PNG)
327
+
328
+ # ADVANCED USER GUIDE
329
+
330
+ This advanced user guide will go over some of the features recently added to the app, including: modifying and merging redaction review files, identifying and redacting duplicate pages across multiple PDFs, 'fuzzy' search and redact, and exporting redactions to Adobe Acrobat.
331
 
332
+ ## Table of contents
333
 
334
+ - [Merging redaction review files](#merging-redaction-review-files)
335
+ - [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
336
+ - [Fuzzy search and redaction](#fuzzy-search-and-redaction)
337
+ - [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
338
+ - [Exporting to Adobe Acrobat](#exporting-to-adobe-acrobat)
339
+ - [Importing from Adobe Acrobat](#importing-from-adobe-acrobat)
340
+ - [Using the AWS Textract document API](#using-the-aws-textract-document-api)
341
+ - [Using AWS Textract and Comprehend when not running in an AWS environment](#using-aws-textract-and-comprehend-when-not-running-in-an-aws-environment)
342
+ - [Modifying existing redaction review files](#modifying-existing-redaction-review-files)
343
+
344
+
345
+ ## Merging redaction review files
346
 
347
  Say you have run multiple redaction tasks on the same document, and you want to merge all of these redactions together. You could do this in your spreadsheet editor, but this could be fiddly especially if dealing with multiple review files or large numbers of redactions. The app has a feature to combine multiple review files together to create a 'merged' review file.
348
 
 
424
 
425
  ![Outputs from Adobe import](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/export_to_adobe/img/import_from_adobe_interface_outputs.PNG)
426
 
427
+ ## Using the AWS Textract document API
428
+
429
+ This option can be enabled by your system admin, in the config file ('SHOW_BULK_TEXTRACT_CALL_OPTIONS' environment variable, and subsequent variables). Using this, you will have the option to submit whole documents in quick succession to the AWS Textract service to get extracted text outputs quickly (faster than using the 'Redact document' process described here).
430
+
431
+ ### Starting a new Textract API job
432
+
433
+ To use this feature, first upload a document file in the file input box [in the usual way](#upload-files-to-the-app) on the first tab of the app. Under AWS Textract signature detection you can select whether or not you would like to analyse signatures or not (with a [cost implication](#optional---select-signature-extraction)).
434
+
435
+ Then, open the section under the heading 'Submit whole document to AWS Textract API...'.
436
+
437
+ ![Textract document API menu](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/textract_document_api.PNG)
438
+
439
+ Click 'Analyse document with AWS Textract API call'. After a few seconds, the job should be submitted to the AWS Textract service. The box 'Job ID to check status' should now have an ID filled in. If it is not already filled with previous jobs (up to seven days old), the table should have a row added with details of the new API job.
440
+
441
+ Click the button underneath, 'Check status of Textract job and download', to see progress on the job. Processing will continue in the background until the job is ready, so it is worth periodically clicking this button to see if the outputs are ready. In testing, and as a rough estimate, it seems like this process takes about five seconds per page. However, this has not been tested with very large documents. Once ready, the '_textract.json' output should appear below.
442
+
443
+ ### Textract API job outputs
444
+
445
+ The '_textract.json' output can be used to speed up further redaction tasks as [described previously](#optional---costs-and-time-estimation), the 'Existing Textract output file found' flag should now be ticked.
446
+
447
+ ![Textract document API initial ouputs](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/textract_api/textract_api_initial_outputs.PNG)
448
+
449
+ You can now easily get the '..._ocr_output.csv' redaction output based on this '_textract.json' (described in [Redaction outputs](#redaction-outputs)) by clicking on the button 'Convert Textract job outputs to OCR results'. You can now use this file e.g. for [identifying duplicate pages](#identifying-and-redacting-duplicate-pages), or for redaction review.
450
+
451
  ## Using AWS Textract and Comprehend when not running in an AWS environment
452
 
453
  AWS Textract and Comprehend give much better results for text extraction and document redaction than the local model options in the app. The most secure way to access them in the Redaction app is to run the app in a secure AWS environment with relevant permissions. Alternatively, you could run the app on your own system while logged in to AWS SSO with relevant permissions.
 
467
 
468
  The app should then pick up these keys when trying to access the AWS Textract and Comprehend services during redaction.
469
 
470
+ Again, a lot can potentially go wrong with AWS solutions that are insecure, so before trying the above please consult with your AWS and data security teams.
471
+
472
+ ## Modifying and merging redaction review files
473
+
474
+ You can find the folder containing the files discussed in this section [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/).
475
+
476
+ As well as serving as inputs to the document redaction app's review function, the 'review_file.csv' output can be modified outside of the app, and also merged with others from multiple redaction attempts on the same file. This gives you the flexibility to change redaction details outside of the app.
477
+
478
+ ### Modifying existing redaction review files
479
+ If you open up a 'review_file' csv output using a spreadsheet software program such as Microsoft Excel you can easily modify redaction properties. Open the file '[Partnership-Agreement-Toolkit_0_0_redacted.pdf_review_file_local.csv](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/Partnership-Agreement-Toolkit_0_0.pdf_review_file_local.csv)', and you should see a spreadshet with just four suggested redactions (see below). The following instructions are for using Excel.
480
+
481
+ ![Review file before](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/merge_review_files/img/review_file_before.PNG)
482
+
483
+ The first thing we can do is remove the first row - 'et' is suggested as a person, but is obviously not a genuine instance of personal information. Right click on the row number and select delete on this menu. Next, let's imagine that what the app identified as a 'phone number' was in fact another type of number and so we wanted to change the label. Simply click on the relevant label cells, let's change it to 'SECURITY_NUMBER'. You could also use 'Finad & Select' -> 'Replace' from the top ribbon menu if you wanted to change a number of labels simultaneously.
484
+
485
+ How about we wanted to change the colour of the 'email address' entry on the redaction review tab of the redaction app? The colours in a review file are based on an RGB scale with three numbers ranging from 0-255. [You can find suitable colours here](https://rgbcolorpicker.com). Using this scale, if I wanted my review box to be pure blue, I can change the cell value to (0,0,255).
486
+
487
+ Imagine that a redaction box was slightly too small, and I didn't want to use the in-app options to change the size. In the review file csv, we can modify e.g. the ymin and ymax values for any box to increase the extent of the redaction box. For the 'email address' entry, let's decrease ymin by 5, and increase ymax by 5.
488
+
489
+ I have saved an output file following the above steps as '[Partnership-Agreement-Toolkit_0_0_redacted.pdf_review_file_local_mod.csv](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/outputs/Partnership-Agreement-Toolkit_0_0.pdf_review_file_local_mod.csv)' in the same folder that the original was found. Let's upload this file to the app along with the original pdf to see how the redactions look now.
490
+
491
+ ![Review file after modification](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/merge_review_files/img/partnership_redactions_after.PNG)
492
+
493
+ We can see from the above that we have successfully removed a redaction box, changed labels, colours, and redaction box sizes.
app.py CHANGED
@@ -4,11 +4,11 @@ import pandas as pd
4
  import gradio as gr
5
  from gradio_image_annotation import image_annotator
6
 
7
- from tools.config import OUTPUT_FOLDER, INPUT_FOLDER, RUN_DIRECT_MODE, MAX_QUEUE_SIZE, DEFAULT_CONCURRENCY_LIMIT, MAX_FILE_SIZE, GRADIO_SERVER_PORT, ROOT_PATH, GET_DEFAULT_ALLOW_LIST, ALLOW_LIST_PATH, S3_ALLOW_LIST_PATH, FEEDBACK_LOGS_FOLDER, ACCESS_LOGS_FOLDER, USAGE_LOGS_FOLDER, TESSERACT_FOLDER, POPPLER_FOLDER, REDACTION_LANGUAGE, GET_COST_CODES, COST_CODES_PATH, S3_COST_CODES_PATH, ENFORCE_COST_CODES, DISPLAY_FILE_NAMES_IN_LOGS, SHOW_COSTS, RUN_AWS_FUNCTIONS, DOCUMENT_REDACTION_BUCKET, SHOW_BULK_TEXTRACT_CALL_OPTIONS, TEXTRACT_BULK_ANALYSIS_BUCKET, TEXTRACT_BULK_ANALYSIS_INPUT_SUBFOLDER, TEXTRACT_BULK_ANALYSIS_OUTPUT_SUBFOLDER, SESSION_OUTPUT_FOLDER, LOAD_PREVIOUS_TEXTRACT_JOBS_S3, TEXTRACT_JOBS_S3_LOC, TEXTRACT_JOBS_LOCAL_LOC, HOST_NAME, DEFAULT_COST_CODE, OUTPUT_COST_CODES_PATH, OUTPUT_ALLOW_LIST_PATH
8
  from tools.helper_functions import put_columns_in_df, get_connection_params, reveal_feedback_buttons, custom_regex_load, reset_state_vars, load_in_default_allow_list, tesseract_ocr_option, text_ocr_option, textract_option, local_pii_detector, aws_pii_detector, no_redaction_option, reset_review_vars, merge_csv_files, load_all_output_files, update_dataframe, check_for_existing_textract_file, load_in_default_cost_codes, enforce_cost_codes, calculate_aws_costs, calculate_time_taken, reset_base_dataframe, reset_ocr_base_dataframe, update_cost_code_dataframe_from_dropdown_select
9
  from tools.aws_functions import upload_file_to_s3, download_file_from_s3
10
  from tools.file_redaction import choose_and_run_redactor
11
- from tools.file_conversion import prepare_image_or_pdf, get_input_file_names, convert_review_df_to_annotation_json
12
  from tools.redaction_review import apply_redactions_to_review_df_and_files, update_all_page_annotation_object_based_on_previous_page, decrease_page, increase_page, update_annotator_object_and_filter_df, update_entities_df_recogniser_entities, update_entities_df_page, update_entities_df_text, df_select_callback, convert_df_to_xfdf, convert_xfdf_to_dataframe, reset_dropdowns, exclude_selected_items_from_redaction, undo_last_removal, update_selected_review_df_row_colour, update_all_entity_df_dropdowns, df_select_callback_cost, update_other_annotator_number_from_current, update_annotator_page_from_review_df, df_select_callback_ocr, df_select_callback_textract_api
13
  from tools.data_anonymise import anonymise_data_files
14
  from tools.auth import authenticate_user
@@ -572,9 +572,9 @@ with app:
572
  text_entity_dropdown.select(update_entities_df_text, inputs=[text_entity_dropdown, recogniser_entity_dataframe_base, recogniser_entity_dropdown, page_entity_dropdown], outputs=[recogniser_entity_dataframe, recogniser_entity_dropdown, page_entity_dropdown])
573
 
574
  # Clicking on a cell in the recogniser entity dataframe will take you to that page, and also highlight the target redaction box in blue
575
- recogniser_entity_dataframe.select(df_select_callback, inputs=[recogniser_entity_dataframe], outputs=[annotate_current_page, selected_entity_dataframe_row]).\
576
- success(update_selected_review_df_row_colour, inputs=[selected_entity_dataframe_row, review_file_state, selected_entity_id, selected_entity_colour, page_sizes], outputs=[review_file_state, selected_entity_id, selected_entity_colour]).\
577
- success(update_annotator_page_from_review_df, inputs=[review_file_state, images_pdf_state, page_sizes, annotate_current_page, annotate_previous_page, all_image_annotations_state, annotator], outputs=[annotator, all_image_annotations_state])
578
 
579
  reset_dropdowns_btn.click(reset_dropdowns, inputs=[recogniser_entity_dataframe_base], outputs=[recogniser_entity_dropdown, text_entity_dropdown, page_entity_dropdown]).\
580
  success(update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, annotate_current_page, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs = [annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state])
@@ -733,7 +733,7 @@ with app:
733
  if __name__ == "__main__":
734
  if RUN_DIRECT_MODE == "0":
735
 
736
- if os.environ['COGNITO_AUTH'] == "1":
737
  app.queue(max_size=int(MAX_QUEUE_SIZE), default_concurrency_limit=int(DEFAULT_CONCURRENCY_LIMIT)).launch(show_error=True, inbrowser=True, auth=authenticate_user, max_file_size=MAX_FILE_SIZE, server_port=GRADIO_SERVER_PORT, root_path=ROOT_PATH)
738
  else:
739
  app.queue(max_size=int(MAX_QUEUE_SIZE), default_concurrency_limit=int(DEFAULT_CONCURRENCY_LIMIT)).launch(show_error=True, inbrowser=True, max_file_size=MAX_FILE_SIZE, server_port=GRADIO_SERVER_PORT, root_path=ROOT_PATH)
 
4
  import gradio as gr
5
  from gradio_image_annotation import image_annotator
6
 
7
+ from tools.config import OUTPUT_FOLDER, INPUT_FOLDER, RUN_DIRECT_MODE, MAX_QUEUE_SIZE, DEFAULT_CONCURRENCY_LIMIT, MAX_FILE_SIZE, GRADIO_SERVER_PORT, ROOT_PATH, GET_DEFAULT_ALLOW_LIST, ALLOW_LIST_PATH, S3_ALLOW_LIST_PATH, FEEDBACK_LOGS_FOLDER, ACCESS_LOGS_FOLDER, USAGE_LOGS_FOLDER, TESSERACT_FOLDER, POPPLER_FOLDER, REDACTION_LANGUAGE, GET_COST_CODES, COST_CODES_PATH, S3_COST_CODES_PATH, ENFORCE_COST_CODES, DISPLAY_FILE_NAMES_IN_LOGS, SHOW_COSTS, RUN_AWS_FUNCTIONS, DOCUMENT_REDACTION_BUCKET, SHOW_BULK_TEXTRACT_CALL_OPTIONS, TEXTRACT_BULK_ANALYSIS_BUCKET, TEXTRACT_BULK_ANALYSIS_INPUT_SUBFOLDER, TEXTRACT_BULK_ANALYSIS_OUTPUT_SUBFOLDER, SESSION_OUTPUT_FOLDER, LOAD_PREVIOUS_TEXTRACT_JOBS_S3, TEXTRACT_JOBS_S3_LOC, TEXTRACT_JOBS_LOCAL_LOC, HOST_NAME, DEFAULT_COST_CODE, OUTPUT_COST_CODES_PATH, OUTPUT_ALLOW_LIST_PATH, COGNITO_AUTH
8
  from tools.helper_functions import put_columns_in_df, get_connection_params, reveal_feedback_buttons, custom_regex_load, reset_state_vars, load_in_default_allow_list, tesseract_ocr_option, text_ocr_option, textract_option, local_pii_detector, aws_pii_detector, no_redaction_option, reset_review_vars, merge_csv_files, load_all_output_files, update_dataframe, check_for_existing_textract_file, load_in_default_cost_codes, enforce_cost_codes, calculate_aws_costs, calculate_time_taken, reset_base_dataframe, reset_ocr_base_dataframe, update_cost_code_dataframe_from_dropdown_select
9
  from tools.aws_functions import upload_file_to_s3, download_file_from_s3
10
  from tools.file_redaction import choose_and_run_redactor
11
+ from tools.file_conversion import prepare_image_or_pdf, get_input_file_names
12
  from tools.redaction_review import apply_redactions_to_review_df_and_files, update_all_page_annotation_object_based_on_previous_page, decrease_page, increase_page, update_annotator_object_and_filter_df, update_entities_df_recogniser_entities, update_entities_df_page, update_entities_df_text, df_select_callback, convert_df_to_xfdf, convert_xfdf_to_dataframe, reset_dropdowns, exclude_selected_items_from_redaction, undo_last_removal, update_selected_review_df_row_colour, update_all_entity_df_dropdowns, df_select_callback_cost, update_other_annotator_number_from_current, update_annotator_page_from_review_df, df_select_callback_ocr, df_select_callback_textract_api
13
  from tools.data_anonymise import anonymise_data_files
14
  from tools.auth import authenticate_user
 
572
  text_entity_dropdown.select(update_entities_df_text, inputs=[text_entity_dropdown, recogniser_entity_dataframe_base, recogniser_entity_dropdown, page_entity_dropdown], outputs=[recogniser_entity_dataframe, recogniser_entity_dropdown, page_entity_dropdown])
573
 
574
  # Clicking on a cell in the recogniser entity dataframe will take you to that page, and also highlight the target redaction box in blue
575
+ recogniser_entity_dataframe.select(df_select_callback, inputs=[recogniser_entity_dataframe], outputs=[selected_entity_dataframe_row]).\
576
+ success(update_selected_review_df_row_colour, inputs=[selected_entity_dataframe_row, review_file_state, selected_entity_id, selected_entity_colour], outputs=[review_file_state, selected_entity_id, selected_entity_colour]).\
577
+ success(update_annotator_page_from_review_df, inputs=[review_file_state, images_pdf_state, page_sizes, all_image_annotations_state, annotator, selected_entity_dataframe_row, input_folder_textbox, doc_full_file_name_textbox], outputs=[annotator, all_image_annotations_state, annotate_current_page, page_sizes, review_file_state, annotate_previous_page])
578
 
579
  reset_dropdowns_btn.click(reset_dropdowns, inputs=[recogniser_entity_dataframe_base], outputs=[recogniser_entity_dropdown, text_entity_dropdown, page_entity_dropdown]).\
580
  success(update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, annotate_current_page, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs = [annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state])
 
733
  if __name__ == "__main__":
734
  if RUN_DIRECT_MODE == "0":
735
 
736
+ if COGNITO_AUTH == "1":
737
  app.queue(max_size=int(MAX_QUEUE_SIZE), default_concurrency_limit=int(DEFAULT_CONCURRENCY_LIMIT)).launch(show_error=True, inbrowser=True, auth=authenticate_user, max_file_size=MAX_FILE_SIZE, server_port=GRADIO_SERVER_PORT, root_path=ROOT_PATH)
738
  else:
739
  app.queue(max_size=int(MAX_QUEUE_SIZE), default_concurrency_limit=int(DEFAULT_CONCURRENCY_LIMIT)).launch(show_error=True, inbrowser=True, max_file_size=MAX_FILE_SIZE, server_port=GRADIO_SERVER_PORT, root_path=ROOT_PATH)
tools/config.py CHANGED
@@ -237,7 +237,7 @@ else: OUTPUT_ALLOW_LIST_PATH = 'config/default_allow_list.csv'
237
 
238
  SHOW_COSTS = get_or_create_env_var('SHOW_COSTS', 'False')
239
 
240
- GET_COST_CODES = get_or_create_env_var('GET_COST_CODES', 'False')
241
 
242
  DEFAULT_COST_CODE = get_or_create_env_var('DEFAULT_COST_CODE', '')
243
 
@@ -246,7 +246,7 @@ COST_CODES_PATH = get_or_create_env_var('COST_CODES_PATH', '') # 'config/COST_CE
246
  S3_COST_CODES_PATH = get_or_create_env_var('S3_COST_CODES_PATH', '') # COST_CENTRES.csv # This is a path within the DOCUMENT_REDACTION_BUCKET
247
 
248
  if COST_CODES_PATH: OUTPUT_COST_CODES_PATH = COST_CODES_PATH
249
- else: OUTPUT_COST_CODES_PATH = 'config/COST_CENTRES.csv'
250
 
251
  ENFORCE_COST_CODES = get_or_create_env_var('ENFORCE_COST_CODES', 'False') # If you have cost codes listed, is it compulsory to choose one before redacting?
252
 
 
237
 
238
  SHOW_COSTS = get_or_create_env_var('SHOW_COSTS', 'False')
239
 
240
+ GET_COST_CODES = get_or_create_env_var('GET_COST_CODES', 'True')
241
 
242
  DEFAULT_COST_CODE = get_or_create_env_var('DEFAULT_COST_CODE', '')
243
 
 
246
  S3_COST_CODES_PATH = get_or_create_env_var('S3_COST_CODES_PATH', '') # COST_CENTRES.csv # This is a path within the DOCUMENT_REDACTION_BUCKET
247
 
248
  if COST_CODES_PATH: OUTPUT_COST_CODES_PATH = COST_CODES_PATH
249
+ else: OUTPUT_COST_CODES_PATH = ''
250
 
251
  ENFORCE_COST_CODES = get_or_create_env_var('ENFORCE_COST_CODES', 'False') # If you have cost codes listed, is it compulsory to choose one before redacting?
252
 
tools/file_conversion.py CHANGED
@@ -21,6 +21,7 @@ from PIL import Image
21
  from scipy.spatial import cKDTree
22
  import random
23
  import string
 
24
 
25
  IMAGE_NUM_REGEX = re.compile(r'_(\d+)\.png$')
26
 
@@ -617,11 +618,10 @@ def prepare_image_or_pdf(
617
 
618
  elif file_extension in ['.csv']:
619
  if '_review_file' in file_path_without_ext:
620
- #print("file_path:", file_path)
621
  review_file_csv = read_file(file_path)
622
  all_annotations_object = convert_review_df_to_annotation_json(review_file_csv, image_file_paths, page_sizes)
623
  json_from_csv = True
624
- print("Converted CSV review file to image annotation object")
625
  elif '_ocr_output' in file_path_without_ext:
626
  all_line_level_ocr_results_df = read_file(file_path)
627
  json_from_csv = False
@@ -850,121 +850,246 @@ def remove_duplicate_images_with_blank_boxes(data: List[dict]) -> List[dict]:
850
 
851
  return result
852
 
853
- def divide_coordinates_by_page_sizes(review_file_df:pd.DataFrame, page_sizes_df:pd.DataFrame, xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax"):
854
-
855
- '''Convert data to same coordinate system. If all coordinates all greater than one, this is a absolute image coordinates - change back to relative coordinates.'''
856
-
857
- review_file_df_out = review_file_df
858
-
859
- if xmin in review_file_df.columns and not review_file_df.empty:
860
- coord_cols = [xmin, xmax, ymin, ymax]
861
- for col in coord_cols:
862
- review_file_df.loc[:, col] = pd.to_numeric(review_file_df[col], errors="coerce")
863
-
864
- review_file_df_orig = review_file_df.copy().loc[(review_file_df[xmin] <= 1) & (review_file_df[xmax] <= 1) & (review_file_df[ymin] <= 1) & (review_file_df[ymax] <= 1),:]
865
-
866
- #print("review_file_df_orig:", review_file_df_orig)
867
-
868
- review_file_df_div = review_file_df.loc[(review_file_df[xmin] > 1) & (review_file_df[xmax] > 1) & (review_file_df[ymin] > 1) & (review_file_df[ymax] > 1),:]
869
-
870
- #print("review_file_df_div:", review_file_df_div)
871
-
872
- review_file_df_div.loc[:, "page"] = pd.to_numeric(review_file_df_div["page"], errors="coerce")
873
 
874
- if "image_width" not in review_file_df_div.columns and not page_sizes_df.empty:
 
875
 
876
- page_sizes_df["image_width"] = page_sizes_df["image_width"].replace("<NA>", pd.NA)
877
- page_sizes_df["image_height"] = page_sizes_df["image_height"].replace("<NA>", pd.NA)
878
- review_file_df_div = review_file_df_div.merge(page_sizes_df[["page", "image_width", "image_height", "mediabox_width", "mediabox_height"]], on="page", how="left")
 
 
879
 
880
- if "image_width" in review_file_df_div.columns:
881
- if review_file_df_div["image_width"].isna().all(): # Check if all are NaN values. If so, assume we only have mediabox coordinates available
882
- review_file_df_div["image_width"] = review_file_df_div["image_width"].fillna(review_file_df_div["mediabox_width"]).infer_objects()
883
- review_file_df_div["image_height"] = review_file_df_div["image_height"].fillna(review_file_df_div["mediabox_height"]).infer_objects()
 
884
 
885
- convert_type_cols = ["image_width", "image_height", xmin, xmax, ymin, ymax]
886
- review_file_df_div[convert_type_cols] = review_file_df_div[convert_type_cols].apply(pd.to_numeric, errors="coerce")
 
 
887
 
888
- review_file_df_div[xmin] = review_file_df_div[xmin] / review_file_df_div["image_width"]
889
- review_file_df_div[xmax] = review_file_df_div[xmax] / review_file_df_div["image_width"]
890
- review_file_df_div[ymin] = review_file_df_div[ymin] / review_file_df_div["image_height"]
891
- review_file_df_div[ymax] = review_file_df_div[ymax] / review_file_df_div["image_height"]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
892
 
893
- # Concatenate the original and modified DataFrames
894
- dfs_to_concat = [df for df in [review_file_df_orig, review_file_df_div] if not df.empty]
895
- if dfs_to_concat: # Ensure there's at least one non-empty DataFrame
896
- review_file_df_out = pd.concat(dfs_to_concat)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
897
  else:
898
- review_file_df_out = review_file_df # Return an original DataFrame instead of raising an error
899
 
900
- # Only sort if the DataFrame is not empty and contains the required columns
901
- required_sort_columns = {"page", xmin, ymin}
902
- if not review_file_df_out.empty and required_sort_columns.issubset(review_file_df_out.columns):
903
- review_file_df_out.sort_values(["page", ymin, xmin], inplace=True)
904
 
905
- review_file_df_out.drop(["image_width", "image_height", "mediabox_width", "mediabox_height"], axis=1, errors="ignore")
 
906
 
907
- return review_file_df_out
 
 
 
 
 
908
 
909
- def multiply_coordinates_by_page_sizes(review_file_df: pd.DataFrame, page_sizes_df: pd.DataFrame, xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax"):
910
 
 
 
 
 
 
 
 
 
 
911
 
912
- if xmin in review_file_df.columns and not review_file_df.empty:
913
 
914
- coord_cols = [xmin, xmax, ymin, ymax]
915
- for col in coord_cols:
916
- review_file_df.loc[:, col] = pd.to_numeric(review_file_df[col], errors="coerce")
 
917
 
918
- # Separate absolute vs relative coordinates
919
- review_file_df_orig = review_file_df.loc[
920
- (review_file_df[xmin] > 1) & (review_file_df[xmax] > 1) &
921
- (review_file_df[ymin] > 1) & (review_file_df[ymax] > 1), :].copy()
922
 
923
- review_file_df = review_file_df.loc[
924
- (review_file_df[xmin] <= 1) & (review_file_df[xmax] <= 1) &
925
- (review_file_df[ymin] <= 1) & (review_file_df[ymax] <= 1), :].copy()
 
 
 
 
926
 
927
- if review_file_df.empty:
928
- return review_file_df_orig # If nothing is left, return the original absolute-coordinates DataFrame
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
929
 
930
- review_file_df.loc[:, "page"] = pd.to_numeric(review_file_df["page"], errors="coerce")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
931
 
932
- if "image_width" not in review_file_df.columns and not page_sizes_df.empty:
933
- page_sizes_df[['image_width', 'image_height']] = page_sizes_df[['image_width','image_height']].replace("<NA>", pd.NA) # Ensure proper NA handling
934
- review_file_df = review_file_df.merge(page_sizes_df, on="page", how="left")
 
935
 
936
- if "image_width" in review_file_df.columns:
937
- # Split into rows with/without image size info
938
- review_file_df_not_na = review_file_df.loc[review_file_df["image_width"].notna()].copy()
939
- review_file_df_na = review_file_df.loc[review_file_df["image_width"].isna()].copy()
940
 
941
- if not review_file_df_not_na.empty:
942
- convert_type_cols = ["image_width", "image_height", xmin, xmax, ymin, ymax]
943
- review_file_df_not_na[convert_type_cols] = review_file_df_not_na[convert_type_cols].apply(pd.to_numeric, errors="coerce")
 
944
 
945
- # Multiply coordinates by image sizes
946
- review_file_df_not_na[xmin] *= review_file_df_not_na["image_width"]
947
- review_file_df_not_na[xmax] *= review_file_df_not_na["image_width"]
948
- review_file_df_not_na[ymin] *= review_file_df_not_na["image_height"]
949
- review_file_df_not_na[ymax] *= review_file_df_not_na["image_height"]
950
 
951
- # Concatenate the modified and unmodified data
952
- review_file_df = pd.concat([df for df in [review_file_df_not_na, review_file_df_na] if not df.empty])
 
953
 
954
- # Merge with the original absolute-coordinates DataFrame
955
- dfs_to_concat = [df for df in [review_file_df_orig, review_file_df] if not df.empty]
956
- if dfs_to_concat: # Ensure there's at least one non-empty DataFrame
957
- review_file_df = pd.concat(dfs_to_concat)
958
- else:
959
- review_file_df = pd.DataFrame() # Return an empty DataFrame instead of raising an error
960
 
961
- # Only sort if the DataFrame is not empty and contains the required columns
962
- required_sort_columns = {"page", "xmin", "ymin"}
963
- if not review_file_df.empty and required_sort_columns.issubset(review_file_df.columns):
964
- review_file_df.sort_values(["page", "xmin", "ymin"], inplace=True)
965
 
966
- return review_file_df
 
 
 
 
967
 
 
968
 
969
  def do_proximity_match_by_page_for_text(df1:pd.DataFrame, df2:pd.DataFrame):
970
  '''
@@ -1018,7 +1143,6 @@ def do_proximity_match_by_page_for_text(df1:pd.DataFrame, df2:pd.DataFrame):
1018
 
1019
  return merged_df
1020
 
1021
-
1022
  def do_proximity_match_all_pages_for_text(df1:pd.DataFrame, df2:pd.DataFrame, threshold:float=0.03):
1023
  '''
1024
  Match text from one dataframe to another based on proximity matching of coordinates across all pages.
@@ -1142,12 +1266,12 @@ def convert_annotation_data_to_dataframe(all_annotations: List[Dict[str, Any]]):
1142
  # prevents this from being necessary.
1143
 
1144
  # 7. Ensure essential columns exist and set column order
1145
- essential_box_cols = ["xmin", "xmax", "ymin", "ymax", "text", "id"]
1146
  for col in essential_box_cols:
1147
  if col not in final_df.columns:
1148
  final_df[col] = pd.NA # Add column with NA if it wasn't present in any box
1149
 
1150
- base_cols = ["image", "page"]
1151
  extra_box_cols = [col for col in final_df.columns if col not in base_cols and col not in essential_box_cols]
1152
  final_col_order = base_cols + essential_box_cols + sorted(extra_box_cols)
1153
 
@@ -1185,7 +1309,8 @@ def create_annotation_dicts_from_annotation_df(
1185
  available_cols = [col for col in box_cols if col in all_image_annotations_df.columns]
1186
 
1187
  if 'text' in all_image_annotations_df.columns:
1188
- all_image_annotations_df.loc[all_image_annotations_df['text'].isnull(), 'text'] = ''
 
1189
 
1190
  if not available_cols:
1191
  print(f"Warning: None of the expected box columns ({box_cols}) found in DataFrame.")
@@ -1226,85 +1351,84 @@ def create_annotation_dicts_from_annotation_df(
1226
 
1227
  return result
1228
 
1229
- def convert_annotation_json_to_review_df(all_annotations: List[dict],
1230
- redaction_decision_output: pd.DataFrame = pd.DataFrame(),
1231
- page_sizes: List[dict] = [],
1232
- do_proximity_match: bool = True) -> pd.DataFrame:
 
 
1233
  '''
1234
  Convert the annotation json data to a dataframe format.
1235
  Add on any text from the initial review_file dataframe by joining based on 'id' if available
1236
  in both sources, otherwise falling back to joining on pages/co-ordinates (if option selected).
 
 
 
1237
  '''
1238
 
1239
  # 1. Convert annotations to DataFrame
1240
- # Ensure convert_annotation_data_to_dataframe populates the 'id' column
1241
- # if 'id' exists in the dictionaries within all_annotations.
1242
-
1243
  review_file_df = convert_annotation_data_to_dataframe(all_annotations)
1244
 
1245
- # Only keep rows in review_df where there are coordinates
1246
- review_file_df.dropna(subset='xmin', axis=0, inplace=True)
 
1247
 
1248
  # Exit early if the initial conversion results in an empty DataFrame
1249
  if review_file_df.empty:
1250
  # Define standard columns for an empty return DataFrame
1251
- check_columns = ["image", "page", "label", "color", "xmin", "ymin", "xmax", "ymax", "text", "id"]
1252
- # Ensure 'id' is included if it might have been expected
1253
- return pd.DataFrame(columns=[col for col in check_columns if col != 'id' or 'id' in review_file_df.columns])
1254
-
1255
- # 2. Handle page sizes if provided
1256
- if not page_sizes:
1257
- page_sizes_df = pd.DataFrame(page_sizes) # Ensure it's a DataFrame
1258
- # Safely convert page column to numeric
1259
- page_sizes_df["page"] = pd.to_numeric(page_sizes_df["page"], errors="coerce")
1260
- page_sizes_df.dropna(subset=["page"], inplace=True) # Drop rows where conversion failed
1261
- page_sizes_df["page"] = page_sizes_df["page"].astype(int) # Convert to int after handling errors/NaNs
 
 
 
1262
 
1263
 
1264
- # Apply coordinate division if page_sizes_df is not empty after processing
 
 
 
 
1265
  if not page_sizes_df.empty:
1266
- # Ensure 'page' column in review_file_df is numeric for merging
1267
- if 'page' in review_file_df.columns:
1268
- review_file_df['page'] = pd.to_numeric(review_file_df['page'], errors='coerce')
1269
- # Drop rows with invalid pages before division
1270
- review_file_df.dropna(subset=['page'], inplace=True)
1271
- review_file_df['page'] = review_file_df['page'].astype(int)
1272
- review_file_df = divide_coordinates_by_page_sizes(review_file_df, page_sizes_df)
1273
-
1274
- print("review_file_df after coord divide:", review_file_df)
1275
-
1276
- # Also apply to redaction_decision_output if it's not empty and has page numbers
1277
- if not redaction_decision_output.empty and 'page' in redaction_decision_output.columns:
1278
- redaction_decision_output['page'] = pd.to_numeric(redaction_decision_output['page'], errors='coerce')
1279
- # Drop rows with invalid pages before division
1280
- redaction_decision_output.dropna(subset=['page'], inplace=True)
1281
- redaction_decision_output['page'] = redaction_decision_output['page'].astype(int)
1282
- redaction_decision_output = divide_coordinates_by_page_sizes(redaction_decision_output, page_sizes_df)
1283
-
1284
- print("redaction_decision_output after coord divide:", redaction_decision_output)
1285
- else:
1286
- print("Warning: Page sizes DataFrame became empty after processing, skipping coordinate division.")
1287
 
1288
 
1289
  # 3. Join additional data from redaction_decision_output if provided
 
 
1290
  if not redaction_decision_output.empty:
1291
- # --- NEW LOGIC: Prioritize joining by 'id' ---
1292
- id_col_exists_in_review = 'id' in review_file_df.columns
1293
- id_col_exists_in_redaction = 'id' in redaction_decision_output.columns
1294
- joined_by_id = False # Flag to track if ID join was successful
 
 
1295
 
1296
  if id_col_exists_in_review and id_col_exists_in_redaction:
1297
  #print("Attempting to join data based on 'id' column.")
1298
  try:
1299
- # Ensure 'id' columns are of compatible types (e.g., string) to avoid merge errors
1300
  review_file_df['id'] = review_file_df['id'].astype(str)
1301
- # Make a copy to avoid SettingWithCopyWarning if redaction_decision_output is used elsewhere
 
1302
  redaction_copy = redaction_decision_output.copy()
1303
  redaction_copy['id'] = redaction_copy['id'].astype(str)
1304
 
1305
- # Select columns to merge from redaction output.
1306
- # Primarily interested in 'text', but keep 'id' for the merge key.
1307
- # Add other columns from redaction_copy if needed.
1308
  cols_to_merge = ['id']
1309
  if 'text' in redaction_copy.columns:
1310
  cols_to_merge.append('text')
@@ -1312,83 +1436,128 @@ def convert_annotation_json_to_review_df(all_annotations: List[dict],
1312
  print("Warning: 'text' column not found in redaction_decision_output. Cannot merge text using 'id'.")
1313
 
1314
  # Perform a left merge to keep all annotations and add matching text
1315
- # Suffixes prevent collision if 'text' already exists and we want to compare/choose
1316
- original_cols = review_file_df.columns.tolist()
 
 
1317
  merged_df = pd.merge(
1318
  review_file_df,
1319
  redaction_copy[cols_to_merge],
1320
  on='id',
1321
  how='left',
1322
- suffixes=('', '_redaction') # Suffix applied to columns from right df if names clash
1323
  )
1324
 
1325
- # Update the original 'text' column. Prioritize text from redaction output.
1326
- # If redaction output had 'text', a 'text_redaction' column now exists.
1327
- if 'text_redaction' in merged_df.columns:
1328
- if 'text' not in merged_df.columns: # If review_file_df didn't have text initially
1329
- merged_df['text'] = merged_df['text_redaction']
1330
- else:
1331
- # Use text from redaction where available, otherwise keep original text
1332
- merged_df['text'] = merged_df['text_redaction'].combine_first(merged_df['text'])
1333
-
1334
- # Remove the temporary column
1335
- merged_df = merged_df.drop(columns=['text_redaction'])
1336
 
1337
- # Ensure final columns match original expectation + potentially new 'text'
1338
- final_cols = original_cols
1339
- if 'text' not in final_cols and 'text' in merged_df.columns:
1340
- final_cols.append('text') # Make sure text column is kept if newly added
1341
- # Reorder/select columns if necessary, ensuring 'id' is kept
1342
- review_file_df = merged_df[[col for col in final_cols if col in merged_df.columns] + (['id'] if 'id' not in final_cols else [])]
1343
 
 
1344
 
1345
- #print("Successfully joined data using 'id'.")
1346
- joined_by_id = True
1347
 
1348
  except Exception as e:
1349
- print(f"Error during 'id'-based merge: {e}. Falling back to proximity match if enabled.")
1350
- # Fall through to proximity match below if an error occurred
1351
-
1352
- # --- Fallback to proximity match ---
1353
- if not joined_by_id and do_proximity_match:
1354
- if not id_col_exists_in_review or not id_col_exists_in_redaction:
1355
- print("Could not join by 'id' (column missing in one or both sources).")
1356
- print("Performing proximity match to add text data.")
1357
- # Match text to review file using proximity
1358
-
1359
- review_file_df = do_proximity_match_all_pages_for_text(df1=review_file_df.copy(), df2=redaction_decision_output.copy())
1360
- elif not joined_by_id and not do_proximity_match:
1361
- print("Skipping joining text data (ID join not possible, proximity match disabled).")
1362
- # --- End of join logic ---
1363
-
1364
- # 4. Ensure required columns exist, filling with blank if they don't
1365
- # Define base required columns, 'id' might or might not be present initially
1366
- required_columns = ["image", "page", "label", "color", "xmin", "ymin", "xmax", "ymax", "text"]
1367
- # Add 'id' to required list if it exists in the dataframe at this point
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1368
  if 'id' in review_file_df.columns:
1369
- required_columns.append('id')
 
 
1370
 
1371
- for col in required_columns:
 
1372
  if col not in review_file_df.columns:
1373
- # Decide default value based on column type (e.g., '' for text, np.nan for numeric?)
1374
- # Using '' for simplicity here.
1375
- review_file_df[col] = ''
1376
 
1377
  # Select and order the final set of columns
1378
- review_file_df = review_file_df[required_columns]
 
 
1379
 
1380
  # 5. Final processing and sorting
1381
- # If colours are saved as list, convert to tuple
1382
  if 'color' in review_file_df.columns:
1383
- review_file_df["color"] = review_file_df["color"].apply(lambda x: tuple(x) if isinstance(x, list) else x)
 
 
1384
 
1385
  # Sort the results
1386
- sort_columns = ['page', 'ymin', 'xmin', 'label']
1387
  # Ensure sort columns exist before sorting
 
1388
  valid_sort_columns = [col for col in sort_columns if col in review_file_df.columns]
1389
- if valid_sort_columns:
1390
- review_file_df = review_file_df.sort_values(valid_sort_columns)
1391
-
 
 
 
 
 
 
 
1392
  return review_file_df
1393
 
1394
  def fill_missing_box_ids(data_input: dict) -> dict:
@@ -1472,20 +1641,18 @@ def fill_missing_box_ids(data_input: dict) -> dict:
1472
 
1473
  def fill_missing_ids(df: pd.DataFrame, column_name: str = 'id', length: int = 12) -> pd.DataFrame:
1474
  """
1475
- Generates unique alphanumeric IDs for rows in a DataFrame column
1476
- where the value is missing (NaN, None) or an empty string.
1477
 
1478
  Args:
1479
  df (pd.DataFrame): The input Pandas DataFrame.
1480
  column_name (str): The name of the column to check and fill (defaults to 'id').
1481
  This column will be added if it doesn't exist.
1482
  length (int): The desired length of the generated IDs (defaults to 12).
1483
- Cannot exceed the limits that guarantee uniqueness based
1484
- on the number of IDs needed and character set size.
1485
 
1486
  Returns:
1487
  pd.DataFrame: The DataFrame with missing/empty IDs filled in the specified column.
1488
- Note: The function modifies the DataFrame in place.
1489
  """
1490
 
1491
  # --- Input Validation ---
@@ -1497,43 +1664,59 @@ def fill_missing_ids(df: pd.DataFrame, column_name: str = 'id', length: int = 12
1497
  raise ValueError("'length' must be a positive integer.")
1498
 
1499
  # --- Ensure Column Exists ---
 
1500
  if column_name not in df.columns:
1501
  print(f"Column '{column_name}' not found. Adding it to the DataFrame.")
1502
- df[column_name] = np.nan # Initialize with NaN
 
 
 
 
 
1503
 
1504
  # --- Identify Rows Needing IDs ---
1505
- # Check for NaN, None, or empty strings ('')
1506
- # Convert to string temporarily for robust empty string check, handle potential errors
1507
- try:
1508
- df[column_name] = df[column_name].astype(str) #handles NaN/None conversion, .str.strip() removes whitespace
1509
- is_missing_or_empty = (
1510
- df[column_name].isna()
1511
- #| (df[column_name].astype(str).str.strip() == '')
1512
- #| (df[column_name] == "nan")
1513
- | (df[column_name].astype(str).str.len() != length)
1514
- )
1515
- except Exception as e:
1516
- # Fallback if conversion to string fails (e.g., column contains complex objects)
1517
- print(f"Warning: Could not perform reliable empty string check on column '{column_name}' due to data type issues. Checking for NaN/None only. Error: {e}")
1518
- is_missing_or_empty = df[column_name].isna()
1519
 
1520
  rows_to_fill_index = df.index[is_missing_or_empty]
1521
  num_needed = len(rows_to_fill_index)
1522
 
1523
  if num_needed == 0:
1524
- #print(f"No missing or empty values found in column '{column_name}'.")
 
 
 
 
 
 
1525
  return df
1526
 
1527
  print(f"Found {num_needed} rows requiring a unique ID in column '{column_name}'.")
1528
 
1529
  # --- Get Existing IDs to Ensure Uniqueness ---
1530
- try:
1531
- # Get all non-missing, non-empty string values from the column
1532
- existing_ids = set(df.loc[~is_missing_or_empty, column_name].astype(str))
1533
- except Exception as e:
1534
- print(f"Warning: Could not reliably get all existing string IDs from column '{column_name}' due to data type issues. Uniqueness check might be less strict. Error: {e}")
1535
- # Fallback: Get only non-NaN IDs, potential type issues ignored
1536
- existing_ids = set(df.loc[df[column_name].notna(), column_name])
 
 
 
 
 
1537
 
1538
 
1539
  # --- Generate Unique IDs ---
@@ -1543,93 +1726,230 @@ def fill_missing_ids(df: pd.DataFrame, column_name: str = 'id', length: int = 12
1543
 
1544
  max_possible_ids = len(character_set) ** length
1545
  if num_needed > max_possible_ids:
1546
- raise ValueError(f"Cannot generate {num_needed} unique IDs with length {length}. Maximum possible is {max_possible_ids}.")
1547
- # Add a check for practical limits if needed, e.g., if num_needed is very close to max_possible_ids, generation could be slow.
 
 
1548
 
1549
  #print(f"Generating {num_needed} unique IDs of length {length}...")
1550
  for i in range(num_needed):
1551
  attempts = 0
1552
  while True:
1553
  candidate_id = ''.join(random.choices(character_set, k=length))
1554
- # Check against *all* existing IDs and *newly* generated ones
1555
  if candidate_id not in existing_ids and candidate_id not in generated_ids_set:
1556
  generated_ids_set.add(candidate_id)
1557
  new_ids_list.append(candidate_id)
1558
  break # Found a unique ID
1559
  attempts += 1
1560
- if attempts > num_needed * 100 and attempts > 1000 : # Safety break for unlikely infinite loop
1561
- raise RuntimeError(f"Failed to generate a unique ID after {attempts} attempts. Check length and character set or existing IDs.")
1562
 
1563
- # Optional progress update for large numbers
1564
- if (i + 1) % 1000 == 0:
1565
- print(f"Generated {i+1}/{num_needed} IDs...")
1566
 
1567
 
1568
  # --- Assign New IDs ---
1569
  # Use the previously identified index to assign the new IDs correctly
 
 
 
 
1570
  df.loc[rows_to_fill_index, column_name] = new_ids_list
1571
- #print(f"Successfully filled {len(new_ids_list)} missing values in column '{column_name}'.")
 
 
 
1572
 
1573
- # The DataFrame 'df' has been modified in place
1574
  return df
1575
 
1576
- def convert_review_df_to_annotation_json(review_file_df:pd.DataFrame,
1577
- image_paths:List[Image.Image],
1578
- page_sizes:List[dict]=[]) -> List[dict]:
1579
- '''
1580
- Convert a review csv to a json file for use by the Gradio Annotation object.
1581
- '''
1582
- # Make sure all relevant cols are float
1583
- float_cols = ["page", "xmin", "xmax", "ymin", "ymax"]
1584
- for col in float_cols:
1585
- review_file_df.loc[:, col] = pd.to_numeric(review_file_df.loc[:, col], errors='coerce')
1586
-
1587
- # Convert relative co-ordinates into image coordinates for the image annotation output object
1588
- if page_sizes:
1589
- page_sizes_df = pd.DataFrame(page_sizes)
1590
- page_sizes_df[["page"]] = page_sizes_df[["page"]].apply(pd.to_numeric, errors="coerce")
1591
 
1592
- review_file_df = multiply_coordinates_by_page_sizes(review_file_df, page_sizes_df)
1593
-
1594
- review_file_df = fill_missing_ids(review_file_df)
1595
 
1596
- if 'id' not in review_file_df.columns:
1597
- review_file_df['id'] = ''
1598
- review_file_df['id'] = review_file_df['id'].astype(str)
1599
-
1600
- # Keep only necessary columns
1601
- review_file_df = review_file_df[["image", "page", "label", "color", "xmin", "ymin", "xmax", "ymax", "id", "text"]].drop_duplicates(subset=["image", "page", "xmin", "ymin", "xmax", "ymax", "label", "id"])
1602
-
1603
- # If colours are saved as list, convert to tuple
1604
- review_file_df.loc[:, "color"] = review_file_df.loc[:,"color"].apply(lambda x: tuple(x) if isinstance(x, list) else x)
1605
 
1606
- # Group the DataFrame by the 'image' column
1607
- grouped_csv_pages = review_file_df.groupby('page')
 
 
 
1608
 
1609
- # Create a list to hold the JSON data
1610
- json_data = []
 
 
 
 
 
 
 
 
 
 
 
1611
 
1612
- for page_no, pdf_image_path in enumerate(page_sizes_df["image_path"]):
1613
-
1614
- reported_page_number = int(page_no + 1)
1615
 
1616
- if reported_page_number in review_file_df["page"].values:
1617
 
1618
- # Convert each relevant group to a list of box dictionaries
1619
- selected_csv_pages = grouped_csv_pages.get_group(reported_page_number)
1620
- annotation_boxes = selected_csv_pages.drop(columns=['image', 'page']).to_dict(orient='records')
1621
-
1622
- annotation = {
1623
- "image": pdf_image_path,
1624
- "boxes": annotation_boxes
1625
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1626
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1627
  else:
1628
- annotation = {}
1629
- annotation["image"] = pdf_image_path
1630
- annotation["boxes"] = []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1631
 
1632
- # Append the structured data to the json_data list
1633
- json_data.append(annotation)
 
1634
 
1635
- return json_data
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  from scipy.spatial import cKDTree
22
  import random
23
  import string
24
+ import warnings # To warn about potential type changes
25
 
26
  IMAGE_NUM_REGEX = re.compile(r'_(\d+)\.png$')
27
 
 
618
 
619
  elif file_extension in ['.csv']:
620
  if '_review_file' in file_path_without_ext:
 
621
  review_file_csv = read_file(file_path)
622
  all_annotations_object = convert_review_df_to_annotation_json(review_file_csv, image_file_paths, page_sizes)
623
  json_from_csv = True
624
+ #print("Converted CSV review file to image annotation object")
625
  elif '_ocr_output' in file_path_without_ext:
626
  all_line_level_ocr_results_df = read_file(file_path)
627
  json_from_csv = False
 
850
 
851
  return result
852
 
853
+ def divide_coordinates_by_page_sizes(
854
+ review_file_df: pd.DataFrame,
855
+ page_sizes_df: pd.DataFrame,
856
+ xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax"
857
+ ) -> pd.DataFrame:
858
+ """
859
+ Optimized function to convert absolute image coordinates (>1) to relative coordinates (<=1).
 
 
 
 
 
 
 
 
 
 
 
 
 
860
 
861
+ Identifies rows with absolute coordinates, merges page size information,
862
+ divides coordinates by dimensions, and combines with already-relative rows.
863
 
864
+ Args:
865
+ review_file_df: Input DataFrame with potentially mixed coordinate systems.
866
+ page_sizes_df: DataFrame with page dimensions ('page', 'image_width',
867
+ 'image_height', 'mediabox_width', 'mediabox_height').
868
+ xmin, xmax, ymin, ymax: Names of the coordinate columns.
869
 
870
+ Returns:
871
+ DataFrame with coordinates converted to relative system, sorted.
872
+ """
873
+ if review_file_df.empty or xmin not in review_file_df.columns:
874
+ return review_file_df # Return early if empty or key column missing
875
 
876
+ # --- Initial Type Conversion ---
877
+ coord_cols = [xmin, xmax, ymin, ymax]
878
+ cols_to_convert = coord_cols + ["page"]
879
+ temp_df = review_file_df.copy() # Work on a copy initially
880
 
881
+ for col in cols_to_convert:
882
+ if col in temp_df.columns:
883
+ temp_df[col] = pd.to_numeric(temp_df[col], errors="coerce")
884
+ else:
885
+ # If essential 'page' or coord column missing, cannot proceed meaningfully
886
+ if col == 'page' or col in coord_cols:
887
+ print(f"Warning: Required column '{col}' not found in review_file_df. Returning original DataFrame.")
888
+ return review_file_df
889
+
890
+ # --- Identify Absolute Coordinates ---
891
+ # Create mask for rows where *all* coordinates are potentially absolute (> 1)
892
+ # Handle potential NaNs introduced by to_numeric - treat NaN as not absolute.
893
+ is_absolute_mask = (
894
+ (temp_df[xmin] > 1) & (temp_df[xmin].notna()) &
895
+ (temp_df[xmax] > 1) & (temp_df[xmax].notna()) &
896
+ (temp_df[ymin] > 1) & (temp_df[ymin].notna()) &
897
+ (temp_df[ymax] > 1) & (temp_df[ymax].notna())
898
+ )
899
 
900
+ # --- Separate DataFrames ---
901
+ df_rel = temp_df[~is_absolute_mask] # Rows already relative or with NaN/mixed coords
902
+ df_abs = temp_df[is_absolute_mask].copy() # Absolute rows - COPY here to allow modifications
903
+
904
+ # --- Process Absolute Coordinates ---
905
+ if not df_abs.empty:
906
+ # Merge page sizes if necessary
907
+ if "image_width" not in df_abs.columns and not page_sizes_df.empty:
908
+ ps_df_copy = page_sizes_df.copy() # Work on a copy of page sizes
909
+
910
+ # Ensure page is numeric for merge key matching
911
+ ps_df_copy['page'] = pd.to_numeric(ps_df_copy['page'], errors='coerce')
912
+
913
+ # Columns to merge from page_sizes
914
+ merge_cols = ['page', 'image_width', 'image_height', 'mediabox_width', 'mediabox_height']
915
+ available_merge_cols = [col for col in merge_cols if col in ps_df_copy.columns]
916
+
917
+ # Prepare dimension columns in the copy
918
+ for col in ['image_width', 'image_height', 'mediabox_width', 'mediabox_height']:
919
+ if col in ps_df_copy.columns:
920
+ # Replace "<NA>" string if present
921
+ if ps_df_copy[col].dtype == 'object':
922
+ ps_df_copy[col] = ps_df_copy[col].replace("<NA>", pd.NA)
923
+ # Convert to numeric
924
+ ps_df_copy[col] = pd.to_numeric(ps_df_copy[col], errors='coerce')
925
+
926
+ # Perform the merge
927
+ if 'page' in available_merge_cols: # Check if page exists for merging
928
+ df_abs = df_abs.merge(
929
+ ps_df_copy[available_merge_cols],
930
+ on="page",
931
+ how="left"
932
+ )
933
+ else:
934
+ print("Warning: 'page' column not found in page_sizes_df. Cannot merge dimensions.")
935
+
936
+
937
+ # Fallback to mediabox dimensions if image dimensions are missing
938
+ if "image_width" in df_abs.columns and "mediabox_width" in df_abs.columns:
939
+ # Check if image_width mostly missing - use .isna().all() or check percentage
940
+ if df_abs["image_width"].isna().all():
941
+ print("Falling back to mediabox dimensions as image_width is entirely missing.")
942
+ df_abs["image_width"] = df_abs["image_width"].fillna(df_abs["mediabox_width"])
943
+ df_abs["image_height"] = df_abs["image_height"].fillna(df_abs["mediabox_height"])
944
+ else:
945
+ # Optional: Fill only missing image dims if some exist?
946
+ # df_abs["image_width"].fillna(df_abs["mediabox_width"], inplace=True)
947
+ # df_abs["image_height"].fillna(df_abs["mediabox_height"], inplace=True)
948
+ pass # Current logic only falls back if ALL image_width are NaN
949
+
950
+ # Ensure divisor columns are numeric before division
951
+ divisors_numeric = True
952
+ for col in ["image_width", "image_height"]:
953
+ if col in df_abs.columns:
954
+ df_abs[col] = pd.to_numeric(df_abs[col], errors='coerce')
955
+ else:
956
+ print(f"Warning: Dimension column '{col}' missing. Cannot perform division.")
957
+ divisors_numeric = False
958
+
959
+
960
+ # Perform division if dimensions are available and numeric
961
+ if divisors_numeric and "image_width" in df_abs.columns and "image_height" in df_abs.columns:
962
+ # Use np.errstate to suppress warnings about division by zero or NaN if desired
963
+ with np.errstate(divide='ignore', invalid='ignore'):
964
+ df_abs[xmin] = df_abs[xmin] / df_abs["image_width"]
965
+ df_abs[xmax] = df_abs[xmax] / df_abs["image_width"]
966
+ df_abs[ymin] = df_abs[ymin] / df_abs["image_height"]
967
+ df_abs[ymax] = df_abs[ymax] / df_abs["image_height"]
968
+ # Replace potential infinities with NaN (optional, depending on desired outcome)
969
+ df_abs.replace([np.inf, -np.inf], np.nan, inplace=True)
970
  else:
971
+ print("Skipping coordinate division due to missing or non-numeric dimension columns.")
972
 
 
 
 
 
973
 
974
+ # --- Combine Relative and Processed Absolute DataFrames ---
975
+ dfs_to_concat = [df for df in [df_rel, df_abs] if not df.empty]
976
 
977
+ if dfs_to_concat:
978
+ final_df = pd.concat(dfs_to_concat, ignore_index=True)
979
+ else:
980
+ # If both splits were empty, return an empty DF with original columns
981
+ print("Warning: Both relative and absolute splits resulted in empty DataFrames.")
982
+ final_df = pd.DataFrame(columns=review_file_df.columns)
983
 
 
984
 
985
+ # --- Final Sort ---
986
+ required_sort_columns = {"page", xmin, ymin}
987
+ if not final_df.empty and required_sort_columns.issubset(final_df.columns):
988
+ # Ensure sort columns are numeric before sorting
989
+ final_df['page'] = pd.to_numeric(final_df['page'], errors='coerce')
990
+ final_df[ymin] = pd.to_numeric(final_df[ymin], errors='coerce')
991
+ final_df[xmin] = pd.to_numeric(final_df[xmin], errors='coerce')
992
+ # Sort by page, ymin, xmin (note order compared to multiply function)
993
+ final_df.sort_values(["page", ymin, xmin], inplace=True, na_position='last')
994
 
 
995
 
996
+ # --- Clean Up Columns ---
997
+ # Correctly drop columns and reassign the result
998
+ cols_to_drop = ["image_width", "image_height", "mediabox_width", "mediabox_height"]
999
+ final_df = final_df.drop(columns=cols_to_drop, errors="ignore")
1000
 
1001
+ return final_df
 
 
 
1002
 
1003
+ def multiply_coordinates_by_page_sizes(
1004
+ review_file_df: pd.DataFrame,
1005
+ page_sizes_df: pd.DataFrame,
1006
+ xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax"
1007
+ ):
1008
+ """
1009
+ Optimized function to convert relative coordinates to absolute based on page sizes.
1010
 
1011
+ Separates relative (<=1) and absolute (>1) coordinates, merges page sizes
1012
+ for relative coordinates, calculates absolute pixel values, and recombines.
1013
+ """
1014
+ if review_file_df.empty or xmin not in review_file_df.columns:
1015
+ return review_file_df # Return early if empty or key column missing
1016
+
1017
+ coord_cols = [xmin, xmax, ymin, ymax]
1018
+ # Initial type conversion for coordinates and page
1019
+ for col in coord_cols + ["page"]:
1020
+ if col in review_file_df.columns:
1021
+ # Use astype for potentially faster conversion if confident,
1022
+ # but to_numeric is safer for mixed types/errors
1023
+ review_file_df[col] = pd.to_numeric(review_file_df[col], errors="coerce")
1024
+
1025
+ # --- Identify relative coordinates ---
1026
+ # Create mask for rows where *all* coordinates are potentially relative (<= 1)
1027
+ # Handle potential NaNs introduced by to_numeric - treat NaN as not relative here.
1028
+ is_relative_mask = (
1029
+ (review_file_df[xmin].le(1) & review_file_df[xmin].notna()) &
1030
+ (review_file_df[xmax].le(1) & review_file_df[xmax].notna()) &
1031
+ (review_file_df[ymin].le(1) & review_file_df[ymin].notna()) &
1032
+ (review_file_df[ymax].le(1) & review_file_df[ymax].notna())
1033
+ )
1034
 
1035
+ # Separate DataFrames (minimal copies)
1036
+ df_abs = review_file_df[~is_relative_mask].copy() # Keep absolute rows separately
1037
+ df_rel = review_file_df[is_relative_mask].copy() # Work only with relative rows
1038
+
1039
+ if df_rel.empty:
1040
+ # If no relative coordinates, just sort and return absolute ones (if any)
1041
+ if not df_abs.empty and {"page", xmin, ymin}.issubset(df_abs.columns):
1042
+ df_abs.sort_values(["page", xmin, ymin], inplace=True, na_position='last')
1043
+ return df_abs
1044
+
1045
+ # --- Process relative coordinates ---
1046
+ if "image_width" not in df_rel.columns and not page_sizes_df.empty:
1047
+ # Prepare page_sizes_df for merge
1048
+ page_sizes_df = page_sizes_df.copy() # Avoid modifying original page_sizes_df
1049
+ page_sizes_df['page'] = pd.to_numeric(page_sizes_df['page'], errors='coerce')
1050
+ # Ensure proper NA handling for image dimensions
1051
+ page_sizes_df[['image_width', 'image_height']] = page_sizes_df[['image_width','image_height']].replace("<NA>", pd.NA)
1052
+ page_sizes_df['image_width'] = pd.to_numeric(page_sizes_df['image_width'], errors='coerce')
1053
+ page_sizes_df['image_height'] = pd.to_numeric(page_sizes_df['image_height'], errors='coerce')
1054
+
1055
+ # Merge page sizes
1056
+ df_rel = df_rel.merge(
1057
+ page_sizes_df[['page', 'image_width', 'image_height']],
1058
+ on="page",
1059
+ how="left"
1060
+ )
1061
 
1062
+ # Multiply coordinates where image dimensions are available
1063
+ if "image_width" in df_rel.columns:
1064
+ # Create mask for rows in df_rel that have valid image dimensions
1065
+ has_size_mask = df_rel["image_width"].notna() & df_rel["image_height"].notna()
1066
 
1067
+ # Apply multiplication using .loc and the mask (vectorized and efficient)
1068
+ # Ensure columns are numeric before multiplication (might be redundant if types are good)
1069
+ # df_rel.loc[has_size_mask, coord_cols + ['image_width', 'image_height']] = df_rel.loc[has_size_mask, coord_cols + ['image_width', 'image_height']].apply(pd.to_numeric, errors='coerce')
 
1070
 
1071
+ df_rel.loc[has_size_mask, xmin] *= df_rel.loc[has_size_mask, "image_width"]
1072
+ df_rel.loc[has_size_mask, xmax] *= df_rel.loc[has_size_mask, "image_width"]
1073
+ df_rel.loc[has_size_mask, ymin] *= df_rel.loc[has_size_mask, "image_height"]
1074
+ df_rel.loc[has_size_mask, ymax] *= df_rel.loc[has_size_mask, "image_height"]
1075
 
 
 
 
 
 
1076
 
1077
+ # --- Combine absolute and processed relative DataFrames ---
1078
+ # Use list comprehension to handle potentially empty DataFrames
1079
+ dfs_to_concat = [df for df in [df_abs, df_rel] if not df.empty]
1080
 
1081
+ if not dfs_to_concat:
1082
+ return pd.DataFrame() # Return empty if both are empty
 
 
 
 
1083
 
1084
+ final_df = pd.concat(dfs_to_concat, ignore_index=True) # ignore_index is good practice after filtering/concat
 
 
 
1085
 
1086
+ # --- Final Sort ---
1087
+ required_sort_columns = {"page", xmin, ymin}
1088
+ if not final_df.empty and required_sort_columns.issubset(final_df.columns):
1089
+ # Handle potential NaNs in sort columns gracefully
1090
+ final_df.sort_values(["page", xmin, ymin], inplace=True, na_position='last')
1091
 
1092
+ return final_df
1093
 
1094
  def do_proximity_match_by_page_for_text(df1:pd.DataFrame, df2:pd.DataFrame):
1095
  '''
 
1143
 
1144
  return merged_df
1145
 
 
1146
  def do_proximity_match_all_pages_for_text(df1:pd.DataFrame, df2:pd.DataFrame, threshold:float=0.03):
1147
  '''
1148
  Match text from one dataframe to another based on proximity matching of coordinates across all pages.
 
1266
  # prevents this from being necessary.
1267
 
1268
  # 7. Ensure essential columns exist and set column order
1269
+ essential_box_cols = ["xmin", "xmax", "ymin", "ymax", "text", "id", "label"]
1270
  for col in essential_box_cols:
1271
  if col not in final_df.columns:
1272
  final_df[col] = pd.NA # Add column with NA if it wasn't present in any box
1273
 
1274
+ base_cols = ["image"]
1275
  extra_box_cols = [col for col in final_df.columns if col not in base_cols and col not in essential_box_cols]
1276
  final_col_order = base_cols + essential_box_cols + sorted(extra_box_cols)
1277
 
 
1309
  available_cols = [col for col in box_cols if col in all_image_annotations_df.columns]
1310
 
1311
  if 'text' in all_image_annotations_df.columns:
1312
+ all_image_annotations_df['text'] = all_image_annotations_df['text'].fillna('')
1313
+ #all_image_annotations_df.loc[all_image_annotations_df['text'].isnull(), 'text'] = ''
1314
 
1315
  if not available_cols:
1316
  print(f"Warning: None of the expected box columns ({box_cols}) found in DataFrame.")
 
1351
 
1352
  return result
1353
 
1354
+ def convert_annotation_json_to_review_df(
1355
+ all_annotations: List[dict],
1356
+ redaction_decision_output: pd.DataFrame = pd.DataFrame(),
1357
+ page_sizes: List[dict] = [],
1358
+ do_proximity_match: bool = True
1359
+ ) -> pd.DataFrame:
1360
  '''
1361
  Convert the annotation json data to a dataframe format.
1362
  Add on any text from the initial review_file dataframe by joining based on 'id' if available
1363
  in both sources, otherwise falling back to joining on pages/co-ordinates (if option selected).
1364
+
1365
+ Refactored for improved efficiency, prioritizing ID-based join and conditionally applying
1366
+ coordinate division and proximity matching.
1367
  '''
1368
 
1369
  # 1. Convert annotations to DataFrame
 
 
 
1370
  review_file_df = convert_annotation_data_to_dataframe(all_annotations)
1371
 
1372
+ # Only keep rows in review_df where there are coordinates (assuming xmin is representative)
1373
+ # Use .notna() for robustness with potential None or NaN values
1374
+ review_file_df.dropna(subset=['xmin', 'ymin', 'xmax', 'ymax'], how='any', inplace=True)
1375
 
1376
  # Exit early if the initial conversion results in an empty DataFrame
1377
  if review_file_df.empty:
1378
  # Define standard columns for an empty return DataFrame
1379
+ # Ensure 'id' is included if it was potentially expected based on input structure
1380
+ # We don't know the columns from convert_annotation_data_to_dataframe without seeing it,
1381
+ # but let's assume a standard set and add 'id' if it appeared.
1382
+ standard_cols = ["image", "page", "label", "color", "xmin", "ymin", "xmax", "ymax", "text"]
1383
+ if 'id' in review_file_df.columns:
1384
+ standard_cols.append('id')
1385
+ return pd.DataFrame(columns=standard_cols)
1386
+
1387
+ # Ensure 'id' column exists for logic flow, even if empty
1388
+ if 'id' not in review_file_df.columns:
1389
+ review_file_df['id'] = ''
1390
+ # Do the same for redaction_decision_output if it's not empty
1391
+ if not redaction_decision_output.empty and 'id' not in redaction_decision_output.columns:
1392
+ redaction_decision_output['id'] = ''
1393
 
1394
 
1395
+ # 2. Process page sizes if provided - needed potentially for coordinate division later
1396
+ # Process this once upfront if the data is available
1397
+ page_sizes_df = pd.DataFrame() # Initialize as empty
1398
+ if page_sizes:
1399
+ page_sizes_df = pd.DataFrame(page_sizes)
1400
  if not page_sizes_df.empty:
1401
+ # Safely convert page column to numeric and then int
1402
+ page_sizes_df["page"] = pd.to_numeric(page_sizes_df["page"], errors="coerce")
1403
+ page_sizes_df.dropna(subset=["page"], inplace=True)
1404
+ if not page_sizes_df.empty: # Check again after dropping NaNs
1405
+ page_sizes_df["page"] = page_sizes_df["page"].astype(int)
1406
+ else:
1407
+ print("Warning: Page sizes DataFrame became empty after processing, coordinate division will be skipped.")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1408
 
1409
 
1410
  # 3. Join additional data from redaction_decision_output if provided
1411
+ text_added_successfully = False # Flag to track if text was added by any method
1412
+
1413
  if not redaction_decision_output.empty:
1414
+ # --- Attempt to join data based on 'id' column first ---
1415
+
1416
+ # Check if 'id' columns are present and have non-null values in *both* dataframes
1417
+ id_col_exists_in_review = 'id' in review_file_df.columns and not review_file_df['id'].isnull().all() and not (review_file_df['id'] == '').all()
1418
+ id_col_exists_in_redaction = 'id' in redaction_decision_output.columns and not redaction_decision_output['id'].isnull().all() and not (redaction_decision_output['id'] == '').all()
1419
+
1420
 
1421
  if id_col_exists_in_review and id_col_exists_in_redaction:
1422
  #print("Attempting to join data based on 'id' column.")
1423
  try:
1424
+ # Ensure 'id' columns are of string type for robust merging
1425
  review_file_df['id'] = review_file_df['id'].astype(str)
1426
+ # Make a copy if needed, but try to avoid if redaction_decision_output isn't modified later
1427
+ # Let's use a copy for safety as in the original code
1428
  redaction_copy = redaction_decision_output.copy()
1429
  redaction_copy['id'] = redaction_copy['id'].astype(str)
1430
 
1431
+ # Select columns to merge from redaction output. Prioritize 'text'.
 
 
1432
  cols_to_merge = ['id']
1433
  if 'text' in redaction_copy.columns:
1434
  cols_to_merge.append('text')
 
1436
  print("Warning: 'text' column not found in redaction_decision_output. Cannot merge text using 'id'.")
1437
 
1438
  # Perform a left merge to keep all annotations and add matching text
1439
+ # Use a suffix for the text column from the right DataFrame
1440
+ original_text_col_exists = 'text' in review_file_df.columns
1441
+ merge_suffix = '_redaction' if original_text_col_exists else ''
1442
+
1443
  merged_df = pd.merge(
1444
  review_file_df,
1445
  redaction_copy[cols_to_merge],
1446
  on='id',
1447
  how='left',
1448
+ suffixes=('', merge_suffix)
1449
  )
1450
 
1451
+ # Update the 'text' column if a new one was brought in
1452
+ if 'text' + merge_suffix in merged_df.columns:
1453
+ redaction_text_col = 'text' + merge_suffix
1454
+ if original_text_col_exists:
1455
+ # Combine: Use text from redaction where available, otherwise keep original
1456
+ merged_df['text'] = merged_df[redaction_text_col].combine_first(merged_df['text'])
1457
+ # Drop the temporary column
1458
+ merged_df = merged_df.drop(columns=[redaction_text_col])
1459
+ else:
1460
+ # Redaction output had text, but review_file_df didn't. Rename the new column.
1461
+ merged_df = merged_df.rename(columns={redaction_text_col: 'text'})
1462
 
1463
+ text_added_successfully = True # Indicate text was potentially added
 
 
 
 
 
1464
 
1465
+ review_file_df = merged_df # Update the main DataFrame
1466
 
1467
+ #print("Successfully attempted to join data using 'id'.") # Note: Text might not have been in redaction data
 
1468
 
1469
  except Exception as e:
1470
+ print(f"Error during 'id'-based merge: {e}. Checking for proximity match fallback.")
1471
+ # Fall through to proximity match logic below
1472
+
1473
+ # --- Fallback to proximity match if ID join wasn't possible/successful and enabled ---
1474
+ # Note: If id_col_exists_in_review or id_col_exists_in_redaction was False,
1475
+ # the block above was skipped, and we naturally fall here.
1476
+ # If an error occurred in the try block, joined_by_id would implicitly be False
1477
+ # because text_added_successfully wasn't set to True.
1478
+
1479
+ # Only attempt proximity match if text wasn't added by ID join and proximity is requested
1480
+ if not text_added_successfully and do_proximity_match:
1481
+ print("Attempting proximity match to add text data.")
1482
+
1483
+ # Ensure 'page' columns are numeric before coordinate division and proximity match
1484
+ # (Assuming divide_coordinates_by_page_sizes and do_proximity_match_all_pages_for_text need this)
1485
+ if 'page' in review_file_df.columns:
1486
+ review_file_df['page'] = pd.to_numeric(review_file_df['page'], errors='coerce').fillna(-1).astype(int) # Use -1 for NaN pages
1487
+ review_file_df = review_file_df[review_file_df['page'] != -1] # Drop rows where page conversion failed
1488
+ if not redaction_decision_output.empty and 'page' in redaction_decision_output.columns:
1489
+ redaction_decision_output['page'] = pd.to_numeric(redaction_decision_output['page'], errors='coerce').fillna(-1).astype(int)
1490
+ redaction_decision_output = redaction_decision_output[redaction_decision_output['page'] != -1]
1491
+
1492
+ # Perform coordinate division IF page_sizes were processed and DataFrame is not empty
1493
+ if not page_sizes_df.empty:
1494
+ # Apply coordinate division *before* proximity match
1495
+ review_file_df = divide_coordinates_by_page_sizes(review_file_df, page_sizes_df)
1496
+ if not redaction_decision_output.empty:
1497
+ redaction_decision_output = divide_coordinates_by_page_sizes(redaction_decision_output, page_sizes_df)
1498
+
1499
+ # Now perform the proximity match
1500
+ # Note: Potential DataFrame copies happen inside do_proximity_match based on its implementation
1501
+ if not redaction_decision_output.empty:
1502
+ try:
1503
+ review_file_df = do_proximity_match_all_pages_for_text(
1504
+ df1=review_file_df, # Pass directly, avoid caller copy if possible by modifying function signature
1505
+ df2=redaction_decision_output # Pass directly
1506
+ )
1507
+ # Assuming do_proximity_match_all_pages_for_text adds/updates the 'text' column
1508
+ if 'text' in review_file_df.columns:
1509
+ text_added_successfully = True
1510
+ print("Proximity match completed.")
1511
+ except Exception as e:
1512
+ print(f"Error during proximity match: {e}. Text data may not be added.")
1513
+
1514
+ elif not text_added_successfully and not do_proximity_match:
1515
+ print("Skipping joining text data (ID join not possible/failed, proximity match disabled).")
1516
+
1517
+ # 4. Ensure required columns exist and are ordered
1518
+ # Define base required columns. 'id' and 'text' are conditionally added.
1519
+ required_columns_base = ["image", "page", "label", "color", "xmin", "ymin", "xmax", "ymax"]
1520
+ final_columns = required_columns_base[:] # Start with base columns
1521
+
1522
+ # Add 'id' and 'text' if they exist in the DataFrame at this point
1523
  if 'id' in review_file_df.columns:
1524
+ final_columns.append('id')
1525
+ if 'text' in review_file_df.columns:
1526
+ final_columns.append('text') # Add text column if it was created/merged
1527
 
1528
+ # Add any missing required columns with a default value (e.g., blank string)
1529
+ for col in final_columns:
1530
  if col not in review_file_df.columns:
1531
+ # Use appropriate default based on expected type, '' for text/id, np.nan for coords?
1532
+ # Sticking to '' as in original for simplicity, but consider data types.
1533
+ review_file_df[col] = '' # Or np.nan for numerical, but coords already checked by dropna
1534
 
1535
  # Select and order the final set of columns
1536
+ # Ensure all selected columns actually exist after adding defaults
1537
+ review_file_df = review_file_df[[col for col in final_columns if col in review_file_df.columns]]
1538
+
1539
 
1540
  # 5. Final processing and sorting
1541
+ # Convert colours from list to tuple if necessary - apply is okay here unless lists are vast
1542
  if 'color' in review_file_df.columns:
1543
+ # Check if the column actually contains lists before applying lambda
1544
+ if review_file_df['color'].apply(lambda x: isinstance(x, list)).any():
1545
+ review_file_df["color"] = review_file_df["color"].apply(lambda x: tuple(x) if isinstance(x, list) else x)
1546
 
1547
  # Sort the results
 
1548
  # Ensure sort columns exist before sorting
1549
+ sort_columns = ['page', 'ymin', 'xmin', 'label']
1550
  valid_sort_columns = [col for col in sort_columns if col in review_file_df.columns]
1551
+ if valid_sort_columns and not review_file_df.empty: # Only sort non-empty df
1552
+ # Convert potential numeric sort columns to appropriate types if necessary
1553
+ # (e.g., 'page', 'ymin', 'xmin') to ensure correct sorting.
1554
+ # dropna(subset=[...], inplace=True) earlier should handle NaNs in coords.
1555
+ # page conversion already done before proximity match.
1556
+ try:
1557
+ review_file_df = review_file_df.sort_values(valid_sort_columns)
1558
+ except TypeError as e:
1559
+ print(f"Warning: Could not sort DataFrame due to type error in sort columns: {e}")
1560
+ # Proceed without sorting
1561
  return review_file_df
1562
 
1563
  def fill_missing_box_ids(data_input: dict) -> dict:
 
1641
 
1642
  def fill_missing_ids(df: pd.DataFrame, column_name: str = 'id', length: int = 12) -> pd.DataFrame:
1643
  """
1644
+ Optimized: Generates unique alphanumeric IDs for rows in a DataFrame column
1645
+ where the value is missing (NaN, None) or an empty/whitespace string.
1646
 
1647
  Args:
1648
  df (pd.DataFrame): The input Pandas DataFrame.
1649
  column_name (str): The name of the column to check and fill (defaults to 'id').
1650
  This column will be added if it doesn't exist.
1651
  length (int): The desired length of the generated IDs (defaults to 12).
 
 
1652
 
1653
  Returns:
1654
  pd.DataFrame: The DataFrame with missing/empty IDs filled in the specified column.
1655
+ Note: The function modifies the DataFrame directly (in-place).
1656
  """
1657
 
1658
  # --- Input Validation ---
 
1664
  raise ValueError("'length' must be a positive integer.")
1665
 
1666
  # --- Ensure Column Exists ---
1667
+ original_dtype = None
1668
  if column_name not in df.columns:
1669
  print(f"Column '{column_name}' not found. Adding it to the DataFrame.")
1670
+ # Initialize with None (which Pandas often treats as NaN but allows object dtype)
1671
+ df[column_name] = None
1672
+ # Set original_dtype to object so it likely becomes string later
1673
+ original_dtype = object
1674
+ else:
1675
+ original_dtype = df[column_name].dtype
1676
 
1677
  # --- Identify Rows Needing IDs ---
1678
+ # 1. Check for actual null values (NaN, None, NaT)
1679
+ is_null = df[column_name].isna()
1680
+
1681
+ # 2. Check for empty or whitespace-only strings AFTER converting potential values to string
1682
+ # Only apply string checks on rows that are *not* null to avoid errors/warnings
1683
+ # Fill NaN temporarily for string operations, then check length or equality
1684
+ is_empty_str = pd.Series(False, index=df.index) # Default to False
1685
+ if not is_null.all(): # Only check strings if there are non-null values
1686
+ temp_str_col = df.loc[~is_null, column_name].astype(str).str.strip()
1687
+ is_empty_str.loc[~is_null] = (temp_str_col == '')
1688
+
1689
+ # Combine the conditions
1690
+ is_missing_or_empty = is_null | is_empty_str
 
1691
 
1692
  rows_to_fill_index = df.index[is_missing_or_empty]
1693
  num_needed = len(rows_to_fill_index)
1694
 
1695
  if num_needed == 0:
1696
+ # Ensure final column type is consistent if nothing was done
1697
+ if pd.api.types.is_object_dtype(original_dtype) or pd.api.types.is_string_dtype(original_dtype):
1698
+ pass # Likely already object or string
1699
+ else:
1700
+ # If original was numeric/etc., but might contain strings now? Unlikely here.
1701
+ pass # Or convert to object: df[column_name] = df[column_name].astype(object)
1702
+ # print(f"No missing or empty values found requiring IDs in column '{column_name}'.")
1703
  return df
1704
 
1705
  print(f"Found {num_needed} rows requiring a unique ID in column '{column_name}'.")
1706
 
1707
  # --- Get Existing IDs to Ensure Uniqueness ---
1708
+ # Consider only rows that are *not* missing/empty
1709
+ valid_rows = df.loc[~is_missing_or_empty, column_name]
1710
+ # Drop any remaining nulls (shouldn't be any based on mask, but belts and braces)
1711
+ valid_rows = valid_rows.dropna()
1712
+ # Convert to string *only* if not already string/object, then filter out empty strings again
1713
+ if not pd.api.types.is_object_dtype(valid_rows.dtype) and not pd.api.types.is_string_dtype(valid_rows.dtype):
1714
+ existing_ids = set(valid_rows.astype(str).str.strip())
1715
+ else: # Already string or object, just strip and convert to set
1716
+ existing_ids = set(valid_rows.astype(str).str.strip()) # astype(str) handles mixed types in object column
1717
+
1718
+ # Remove empty string from existing IDs if it's there after stripping
1719
+ existing_ids.discard('')
1720
 
1721
 
1722
  # --- Generate Unique IDs ---
 
1726
 
1727
  max_possible_ids = len(character_set) ** length
1728
  if num_needed > max_possible_ids:
1729
+ raise ValueError(f"Cannot generate {num_needed} unique IDs with length {length}. Maximum possible is {max_possible_ids}.")
1730
+
1731
+ # Pre-calculate safety break limit
1732
+ max_attempts_per_id = max(1000, num_needed * 10) # Adjust multiplier as needed
1733
 
1734
  #print(f"Generating {num_needed} unique IDs of length {length}...")
1735
  for i in range(num_needed):
1736
  attempts = 0
1737
  while True:
1738
  candidate_id = ''.join(random.choices(character_set, k=length))
1739
+ # Check against *all* known existing IDs and *newly* generated ones
1740
  if candidate_id not in existing_ids and candidate_id not in generated_ids_set:
1741
  generated_ids_set.add(candidate_id)
1742
  new_ids_list.append(candidate_id)
1743
  break # Found a unique ID
1744
  attempts += 1
1745
+ if attempts > max_attempts_per_id : # Safety break
1746
+ raise RuntimeError(f"Failed to generate a unique ID after {attempts} attempts. Check length, character set, or density of existing IDs.")
1747
 
1748
+ # Optional progress update
1749
+ # if (i + 1) % 1000 == 0:
1750
+ # print(f"Generated {i+1}/{num_needed} IDs...")
1751
 
1752
 
1753
  # --- Assign New IDs ---
1754
  # Use the previously identified index to assign the new IDs correctly
1755
+ # Assigning string IDs might change the column's dtype to 'object'
1756
+ if not pd.api.types.is_object_dtype(original_dtype) and not pd.api.types.is_string_dtype(original_dtype):
1757
+ warnings.warn(f"Column '{column_name}' dtype might change from '{original_dtype}' to 'object' due to string ID assignment.", UserWarning)
1758
+
1759
  df.loc[rows_to_fill_index, column_name] = new_ids_list
1760
+ print(f"Successfully assigned {len(new_ids_list)} new unique IDs to column '{column_name}'.")
1761
+
1762
+ # Optional: Convert the entire column to string type at the end for consistency
1763
+ # df[column_name] = df[column_name].astype(str)
1764
 
 
1765
  return df
1766
 
1767
+ def convert_review_df_to_annotation_json(
1768
+ review_file_df: pd.DataFrame,
1769
+ image_paths: List[str], # List of image file paths
1770
+ page_sizes: List[Dict], # List of dicts like [{'page': 1, 'image_path': '...', 'image_width': W, 'image_height': H}, ...]
1771
+ xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax" # Coordinate column names
1772
+ ) -> List[Dict]:
1773
+ """
1774
+ Optimized function to convert review DataFrame to Gradio Annotation JSON format.
 
 
 
 
 
 
 
1775
 
1776
+ Ensures absolute coordinates, handles missing IDs, deduplicates based on key fields,
1777
+ selects final columns, and structures data per image/page based on page_sizes.
 
1778
 
1779
+ Args:
1780
+ review_file_df: Input DataFrame with annotation data.
1781
+ image_paths: List of image file paths (Note: currently unused if page_sizes provides paths).
1782
+ page_sizes: REQUIRED list of dictionaries, each containing 'page',
1783
+ 'image_path', 'image_width', and 'image_height'. Defines
1784
+ output structure and dimensions for coordinate conversion.
1785
+ xmin, xmax, ymin, ymax: Names of the coordinate columns.
 
 
1786
 
1787
+ Returns:
1788
+ List of dictionaries suitable for Gradio Annotation output, one dict per image/page.
1789
+ """
1790
+ if not page_sizes:
1791
+ raise ValueError("page_sizes argument is required and cannot be empty.")
1792
 
1793
+ # --- Prepare Page Sizes DataFrame ---
1794
+ try:
1795
+ page_sizes_df = pd.DataFrame(page_sizes)
1796
+ required_ps_cols = {'page', 'image_path', 'image_width', 'image_height'}
1797
+ if not required_ps_cols.issubset(page_sizes_df.columns):
1798
+ missing = required_ps_cols - set(page_sizes_df.columns)
1799
+ raise ValueError(f"page_sizes is missing required keys: {missing}")
1800
+ # Convert page sizes columns to appropriate numeric types early
1801
+ page_sizes_df['page'] = pd.to_numeric(page_sizes_df['page'], errors='coerce')
1802
+ page_sizes_df['image_width'] = pd.to_numeric(page_sizes_df['image_width'], errors='coerce')
1803
+ page_sizes_df['image_height'] = pd.to_numeric(page_sizes_df['image_height'], errors='coerce')
1804
+ # Use nullable Int64 for page number consistency
1805
+ page_sizes_df['page'] = page_sizes_df['page'].astype('Int64')
1806
 
1807
+ except Exception as e:
1808
+ raise ValueError(f"Error processing page_sizes: {e}") from e
 
1809
 
 
1810
 
1811
+ # Handle empty input DataFrame gracefully
1812
+ if review_file_df.empty:
1813
+ print("Input review_file_df is empty. Proceeding to generate JSON structure with empty boxes.")
1814
+ # Ensure essential columns exist even if empty for later steps
1815
+ for col in [xmin, xmax, ymin, ymax, "page", "label", "color", "id", "text"]:
1816
+ if col not in review_file_df.columns:
1817
+ review_file_df[col] = pd.NA
1818
+ else:
1819
+ # --- Coordinate Conversion (if needed) ---
1820
+ coord_cols_to_check = [c for c in [xmin, xmax, ymin, ymax] if c in review_file_df.columns]
1821
+ needs_multiplication = False
1822
+ if coord_cols_to_check:
1823
+ temp_df_numeric = review_file_df[coord_cols_to_check].apply(pd.to_numeric, errors='coerce')
1824
+ if temp_df_numeric.le(1).any().any(): # Check if any numeric coord <= 1 exists
1825
+ needs_multiplication = True
1826
+
1827
+ if needs_multiplication:
1828
+ #print("Relative coordinates detected or suspected, running multiplication...")
1829
+ review_file_df = multiply_coordinates_by_page_sizes(
1830
+ review_file_df.copy(), # Pass a copy to avoid modifying original outside function
1831
+ page_sizes_df,
1832
+ xmin, xmax, ymin, ymax
1833
+ )
1834
+ else:
1835
+ #print("No relative coordinates detected or required columns missing, skipping multiplication.")
1836
+ # Still ensure essential coordinate/page columns are numeric if they exist
1837
+ cols_to_convert = [c for c in [xmin, xmax, ymin, ymax, "page"] if c in review_file_df.columns]
1838
+ for col in cols_to_convert:
1839
+ review_file_df[col] = pd.to_numeric(review_file_df[col], errors='coerce')
1840
 
1841
+ # Handle potential case where multiplication returns an empty DF
1842
+ if review_file_df.empty:
1843
+ print("DataFrame became empty after coordinate processing.")
1844
+ # Re-add essential columns if they were lost
1845
+ for col in [xmin, xmax, ymin, ymax, "page", "label", "color", "id", "text"]:
1846
+ if col not in review_file_df.columns:
1847
+ review_file_df[col] = pd.NA
1848
+
1849
+ # --- Fill Missing IDs ---
1850
+ review_file_df = fill_missing_ids(review_file_df.copy()) # Pass a copy
1851
+
1852
+ # --- Deduplicate Based on Key Fields ---
1853
+ base_dedupe_cols = ["page", xmin, ymin, xmax, ymax, "label", "id"]
1854
+ # Identify which deduplication columns actually exist in the DataFrame
1855
+ cols_for_dedupe = [col for col in base_dedupe_cols if col in review_file_df.columns]
1856
+ # Add 'image' column for deduplication IF it exists (matches original logic intent)
1857
+ if "image" in review_file_df.columns:
1858
+ cols_for_dedupe.append("image")
1859
+
1860
+ # Ensure placeholder columns exist if they are needed for deduplication
1861
+ # (e.g., 'label', 'id' should be present after fill_missing_ids)
1862
+ for col in ['label', 'id']:
1863
+ if col in cols_for_dedupe and col not in review_file_df.columns:
1864
+ # This might indicate an issue in fill_missing_ids or prior steps
1865
+ print(f"Warning: Column '{col}' needed for dedupe but not found. Adding NA.")
1866
+ review_file_df[col] = "" # Add default empty string
1867
+
1868
+ if cols_for_dedupe: # Only attempt dedupe if we have columns to check
1869
+ #print(f"Deduplicating based on columns: {cols_for_dedupe}")
1870
+ # Convert relevant columns to string before dedupe to avoid type issues with mixed data (optional, depends on data)
1871
+ # for col in cols_for_dedupe:
1872
+ # review_file_df[col] = review_file_df[col].astype(str)
1873
+ review_file_df = review_file_df.drop_duplicates(subset=cols_for_dedupe)
1874
  else:
1875
+ print("Skipping deduplication: No valid columns found to deduplicate by.")
1876
+
1877
+
1878
+ # --- Select and Prepare Final Output Columns ---
1879
+ required_final_cols = ["page", "label", "color", xmin, ymin, xmax, ymax, "id", "text"]
1880
+ # Identify which of the desired final columns exist in the (now potentially deduplicated) DataFrame
1881
+ available_final_cols = [col for col in required_final_cols if col in review_file_df.columns]
1882
+
1883
+ # Ensure essential output columns exist, adding defaults if missing AFTER deduplication
1884
+ for col in required_final_cols:
1885
+ if col not in review_file_df.columns:
1886
+ print(f"Adding missing final column '{col}' with default value.")
1887
+ if col in ['label', 'id', 'text']:
1888
+ review_file_df[col] = "" # Default empty string
1889
+ elif col == 'color':
1890
+ review_file_df[col] = None # Default None or a default color tuple
1891
+ else: # page, coordinates
1892
+ review_file_df[col] = pd.NA # Default NA for numeric/page
1893
+ available_final_cols.append(col) # Add to list of available columns
1894
+
1895
+ # Select only the final desired columns in the correct order
1896
+ review_file_df = review_file_df[available_final_cols]
1897
+
1898
+ # --- Final Formatting ---
1899
+ if not review_file_df.empty:
1900
+ # Convert list colors to tuples (important for some downstream uses)
1901
+ if 'color' in review_file_df.columns:
1902
+ review_file_df['color'] = review_file_df['color'].apply(
1903
+ lambda x: tuple(x) if isinstance(x, list) else x
1904
+ )
1905
+ # Ensure page column is nullable integer type for reliable grouping
1906
+ if 'page' in review_file_df.columns:
1907
+ review_file_df['page'] = review_file_df['page'].astype('Int64')
1908
+
1909
+ # --- Group Annotations by Page ---
1910
+ if 'page' in review_file_df.columns:
1911
+ grouped_annotations = review_file_df.groupby('page')
1912
+ group_keys = set(grouped_annotations.groups.keys()) # Use set for faster lookups
1913
+ else:
1914
+ # Cannot group if page column is missing
1915
+ print("Error: 'page' column missing, cannot group annotations.")
1916
+ grouped_annotations = None
1917
+ group_keys = set()
1918
+
1919
 
1920
+ # --- Build JSON Structure ---
1921
+ json_data = []
1922
+ output_cols_for_boxes = [col for col in ["label", "color", xmin, ymin, xmax, ymax, "id", "text"] if col in review_file_df.columns]
1923
 
1924
+ # Iterate through page_sizes_df to define the structure (one entry per image path)
1925
+ for _, row in page_sizes_df.iterrows():
1926
+ page_num = row['page'] # Already Int64
1927
+ pdf_image_path = row['image_path']
1928
+ annotation_boxes = [] # Default to empty list
1929
+
1930
+ # Check if the page exists in the grouped annotations (using the faster set lookup)
1931
+ # Check pd.notna because page_num could be <NA> if conversion failed
1932
+ if pd.notna(page_num) and page_num in group_keys and grouped_annotations:
1933
+ try:
1934
+ page_group_df = grouped_annotations.get_group(page_num)
1935
+ # Convert the group to list of dicts, selecting only needed box properties
1936
+ # Handle potential NaN coordinates before conversion to JSON
1937
+ annotation_boxes = page_group_df[output_cols_for_boxes].replace({np.nan: None}).to_dict(orient='records')
1938
+
1939
+ # Optional: Round coordinates here if needed AFTER potential multiplication
1940
+ # for box in annotation_boxes:
1941
+ # for coord in [xmin, ymin, xmax, ymax]:
1942
+ # if coord in box and box[coord] is not None:
1943
+ # box[coord] = round(float(box[coord]), 2) # Example: round to 2 decimals
1944
+
1945
+ except KeyError:
1946
+ print(f"Warning: Group key {page_num} not found despite being in group_keys (should not happen).")
1947
+ annotation_boxes = [] # Keep empty
1948
+
1949
+ # Append the structured data for this image/page
1950
+ json_data.append({
1951
+ "image": pdf_image_path,
1952
+ "boxes": annotation_boxes
1953
+ })
1954
+
1955
+ return json_data
tools/file_redaction.py CHANGED
@@ -258,8 +258,7 @@ def choose_and_run_redactor(file_paths:List[str],
258
 
259
 
260
  # Call prepare_image_or_pdf only if needed
261
- if prepare_images_flag is not None:# and first_loop_state==True:
262
- #print("Calling preparation function. prepare_images_flag:", prepare_images_flag)
263
  out_message, prepared_pdf_file_paths, pdf_image_file_paths, annotate_max_pages, annotate_max_pages_bottom, pymupdf_doc, annotations_all_pages, review_file_state, document_cropboxes, page_sizes, textract_output_found, all_img_details_state, placeholder_ocr_results_df = prepare_image_or_pdf(
264
  file_paths_loop, text_extraction_method, 0, out_message, True,
265
  annotate_max_pages, annotations_all_pages, document_cropboxes, redact_whole_page_list,
@@ -333,7 +332,7 @@ def choose_and_run_redactor(file_paths:List[str],
333
  # Try to connect to AWS services directly only if RUN_AWS_FUNCTIONS environmental variable is 1, otherwise an environment variable or direct textbox input is needed.
334
  if pii_identification_method == aws_pii_detector:
335
  if aws_access_key_textbox and aws_secret_key_textbox:
336
- print("Connecting to Comprehend using AWS access key and secret keys from textboxes.")
337
  comprehend_client = boto3.client('comprehend',
338
  aws_access_key_id=aws_access_key_textbox,
339
  aws_secret_access_key=aws_secret_key_textbox, region_name=AWS_REGION)
@@ -356,7 +355,7 @@ def choose_and_run_redactor(file_paths:List[str],
356
  # Try to connect to AWS Textract Client if using that text extraction method
357
  if text_extraction_method == textract_option:
358
  if aws_access_key_textbox and aws_secret_key_textbox:
359
- print("Connecting to Textract using AWS access key and secret keys from textboxes.")
360
  textract_client = boto3.client('textract',
361
  aws_access_key_id=aws_access_key_textbox,
362
  aws_secret_access_key=aws_secret_key_textbox, region_name=AWS_REGION)
@@ -401,7 +400,7 @@ def choose_and_run_redactor(file_paths:List[str],
401
  is_a_pdf = is_pdf(file_path) == True
402
  if is_a_pdf == False and text_extraction_method == text_ocr_option:
403
  # If user has not submitted a pdf, assume it's an image
404
- print("File is not a pdf, assuming that image analysis needs to be used.")
405
  text_extraction_method = tesseract_ocr_option
406
  else:
407
  out_message = "No file selected"
@@ -862,17 +861,6 @@ def convert_pikepdf_annotations_to_result_annotation_box(page:Page, annot:dict,
862
 
863
  rect = Rect(pymupdf_x1, pymupdf_y1, pymupdf_x2, pymupdf_y2)
864
 
865
- # if image or image_dimensions:
866
- # print("Dividing result by image coordinates")
867
-
868
- # image_x1, image_y1, image_x2, image_y2 = convert_pymupdf_to_image_coords(page, pymupdf_x1, pymupdf_y1, pymupdf_x2, pymupdf_y2, image, image_dimensions=image_dimensions)
869
-
870
- # img_annotation_box["xmin"] = image_x1
871
- # img_annotation_box["ymin"] = image_y1
872
- # img_annotation_box["xmax"] = image_x2
873
- # img_annotation_box["ymax"] = image_y2
874
-
875
- # else:
876
  convert_df = pd.DataFrame({
877
  "page": [page_no],
878
  "xmin": [pymupdf_x1],
@@ -1016,9 +1004,6 @@ def redact_page_with_pymupdf(page:Page, page_annotations:dict, image:Image=None,
1016
 
1017
  img_annotation_box = fill_missing_box_ids(img_annotation_box)
1018
 
1019
- #print("image_dimensions:", image_dimensions)
1020
- #print("annot:", annot)
1021
-
1022
  all_image_annotation_boxes.append(img_annotation_box)
1023
 
1024
  # Redact the annotations from the document
@@ -1285,8 +1270,6 @@ def redact_image_pdf(file_path:str,
1285
  page_handwriting_recogniser_results = []
1286
  page_break_return = False
1287
  reported_page_number = str(page_no + 1)
1288
-
1289
- #print("page_sizes_df for row:", page_sizes_df.loc[page_sizes_df["page"] == (page_no + 1)])
1290
 
1291
  # Try to find image location
1292
  try:
@@ -1328,7 +1311,7 @@ def redact_image_pdf(file_path:str,
1328
 
1329
  # Step 1: Perform OCR. Either with Tesseract, or with AWS Textract
1330
 
1331
- # If using Tesseract, need to check if we have page as image_path
1332
  if text_extraction_method == tesseract_ocr_option:
1333
  #print("image_path:", image_path)
1334
  #print("print(type(image_path)):", print(type(image_path)))
@@ -1449,7 +1432,6 @@ def redact_image_pdf(file_path:str,
1449
  # Assume image_path is an image
1450
  image = image_path
1451
 
1452
- print("image:", image)
1453
 
1454
  fill = (0, 0, 0) # Fill colour for redactions
1455
  draw = ImageDraw.Draw(image)
@@ -1631,8 +1613,6 @@ def get_text_container_characters(text_container:LTTextContainer):
1631
  for line in text_container
1632
  if isinstance(line, LTTextLine) or isinstance(line, LTTextLineHorizontal)
1633
  for char in line]
1634
-
1635
- #print("Initial characters:", characters)
1636
 
1637
  return characters
1638
  return []
@@ -1762,9 +1742,6 @@ def create_text_redaction_process_results(analyser_results, analysed_bounding_bo
1762
  analysed_bounding_boxes_df_new = pd.concat([analysed_bounding_boxes_df_new, analysed_bounding_boxes_df_text], axis = 1)
1763
  analysed_bounding_boxes_df_new['page'] = page_num + 1
1764
 
1765
- #analysed_bounding_boxes_df_new = fill_missing_ids(analysed_bounding_boxes_df_new)
1766
- analysed_bounding_boxes_df_new.to_csv("output/analysed_bounding_boxes_df_new_with_ids.csv")
1767
-
1768
  decision_process_table = pd.concat([decision_process_table, analysed_bounding_boxes_df_new], axis = 0).drop('result', axis=1)
1769
 
1770
  return decision_process_table
@@ -1772,7 +1749,6 @@ def create_text_redaction_process_results(analyser_results, analysed_bounding_bo
1772
  def create_pikepdf_annotations_for_bounding_boxes(analysed_bounding_boxes):
1773
  pikepdf_redaction_annotations_on_page = []
1774
  for analysed_bounding_box in analysed_bounding_boxes:
1775
- #print("analysed_bounding_box:", analysed_bounding_boxes)
1776
 
1777
  bounding_box = analysed_bounding_box["boundingBox"]
1778
  annotation = Dictionary(
@@ -1997,7 +1973,6 @@ def redact_text_pdf(
1997
  pass
1998
  #print("Not redacting page:", page_no)
1999
 
2000
- #print("page_image_annotations after page", reported_page_number, "are", page_image_annotations)
2001
 
2002
  # Join extracted text outputs for all lines together
2003
  if not page_text_ocr_outputs.empty:
 
258
 
259
 
260
  # Call prepare_image_or_pdf only if needed
261
+ if prepare_images_flag is not None:
 
262
  out_message, prepared_pdf_file_paths, pdf_image_file_paths, annotate_max_pages, annotate_max_pages_bottom, pymupdf_doc, annotations_all_pages, review_file_state, document_cropboxes, page_sizes, textract_output_found, all_img_details_state, placeholder_ocr_results_df = prepare_image_or_pdf(
263
  file_paths_loop, text_extraction_method, 0, out_message, True,
264
  annotate_max_pages, annotations_all_pages, document_cropboxes, redact_whole_page_list,
 
332
  # Try to connect to AWS services directly only if RUN_AWS_FUNCTIONS environmental variable is 1, otherwise an environment variable or direct textbox input is needed.
333
  if pii_identification_method == aws_pii_detector:
334
  if aws_access_key_textbox and aws_secret_key_textbox:
335
+ print("Connecting to Comprehend using AWS access key and secret keys from user input.")
336
  comprehend_client = boto3.client('comprehend',
337
  aws_access_key_id=aws_access_key_textbox,
338
  aws_secret_access_key=aws_secret_key_textbox, region_name=AWS_REGION)
 
355
  # Try to connect to AWS Textract Client if using that text extraction method
356
  if text_extraction_method == textract_option:
357
  if aws_access_key_textbox and aws_secret_key_textbox:
358
+ print("Connecting to Textract using AWS access key and secret keys from user input.")
359
  textract_client = boto3.client('textract',
360
  aws_access_key_id=aws_access_key_textbox,
361
  aws_secret_access_key=aws_secret_key_textbox, region_name=AWS_REGION)
 
400
  is_a_pdf = is_pdf(file_path) == True
401
  if is_a_pdf == False and text_extraction_method == text_ocr_option:
402
  # If user has not submitted a pdf, assume it's an image
403
+ print("File is not a PDF, assuming that image analysis needs to be used.")
404
  text_extraction_method = tesseract_ocr_option
405
  else:
406
  out_message = "No file selected"
 
861
 
862
  rect = Rect(pymupdf_x1, pymupdf_y1, pymupdf_x2, pymupdf_y2)
863
 
 
 
 
 
 
 
 
 
 
 
 
864
  convert_df = pd.DataFrame({
865
  "page": [page_no],
866
  "xmin": [pymupdf_x1],
 
1004
 
1005
  img_annotation_box = fill_missing_box_ids(img_annotation_box)
1006
 
 
 
 
1007
  all_image_annotation_boxes.append(img_annotation_box)
1008
 
1009
  # Redact the annotations from the document
 
1270
  page_handwriting_recogniser_results = []
1271
  page_break_return = False
1272
  reported_page_number = str(page_no + 1)
 
 
1273
 
1274
  # Try to find image location
1275
  try:
 
1311
 
1312
  # Step 1: Perform OCR. Either with Tesseract, or with AWS Textract
1313
 
1314
+ # If using Tesseract
1315
  if text_extraction_method == tesseract_ocr_option:
1316
  #print("image_path:", image_path)
1317
  #print("print(type(image_path)):", print(type(image_path)))
 
1432
  # Assume image_path is an image
1433
  image = image_path
1434
 
 
1435
 
1436
  fill = (0, 0, 0) # Fill colour for redactions
1437
  draw = ImageDraw.Draw(image)
 
1613
  for line in text_container
1614
  if isinstance(line, LTTextLine) or isinstance(line, LTTextLineHorizontal)
1615
  for char in line]
 
 
1616
 
1617
  return characters
1618
  return []
 
1742
  analysed_bounding_boxes_df_new = pd.concat([analysed_bounding_boxes_df_new, analysed_bounding_boxes_df_text], axis = 1)
1743
  analysed_bounding_boxes_df_new['page'] = page_num + 1
1744
 
 
 
 
1745
  decision_process_table = pd.concat([decision_process_table, analysed_bounding_boxes_df_new], axis = 0).drop('result', axis=1)
1746
 
1747
  return decision_process_table
 
1749
  def create_pikepdf_annotations_for_bounding_boxes(analysed_bounding_boxes):
1750
  pikepdf_redaction_annotations_on_page = []
1751
  for analysed_bounding_box in analysed_bounding_boxes:
 
1752
 
1753
  bounding_box = analysed_bounding_box["boundingBox"]
1754
  annotation = Dictionary(
 
1973
  pass
1974
  #print("Not redacting page:", page_no)
1975
 
 
1976
 
1977
  # Join extracted text outputs for all lines together
1978
  if not page_text_ocr_outputs.empty:
tools/redaction_review.py CHANGED
@@ -6,12 +6,11 @@ import numpy as np
6
  from xml.etree.ElementTree import Element, SubElement, tostring, parse
7
  from xml.dom import minidom
8
  import uuid
9
- from typing import List
10
  from gradio_image_annotation import image_annotator
11
  from gradio_image_annotation.image_annotator import AnnotatedImageData
12
  from pymupdf import Document, Rect
13
  import pymupdf
14
- #from fitz
15
  from PIL import ImageDraw, Image
16
 
17
  from tools.config import OUTPUT_FOLDER, CUSTOM_BOX_COLOUR, MAX_IMAGE_PIXELS, INPUT_FOLDER
@@ -55,7 +54,6 @@ def update_zoom(current_zoom_level:int, annotate_current_page:int, decrease:bool
55
 
56
  return current_zoom_level, annotate_current_page
57
 
58
-
59
  def update_dropdown_list_based_on_dataframe(df:pd.DataFrame, column:str) -> List["str"]:
60
  '''
61
  Gather unique elements from a string pandas Series, then append 'ALL' to the start and return the list.
@@ -166,49 +164,205 @@ def update_recogniser_dataframes(page_image_annotator_object:AnnotatedImageData,
166
 
167
  return recogniser_entities_list, recogniser_dataframe_out_gr, recogniser_dataframe_out, recogniser_entities_drop, text_entities_drop, page_entities_drop
168
 
169
- def undo_last_removal(backup_review_state, backup_image_annotations_state, backup_recogniser_entity_dataframe_base):
170
  return backup_review_state, backup_image_annotations_state, backup_recogniser_entity_dataframe_base
171
 
172
- def update_annotator_page_from_review_df(review_df: pd.DataFrame,
173
- image_file_paths:List[str],
174
- page_sizes:List[dict],
175
- current_page:int,
176
- previous_page:int,
177
- current_image_annotations_state:List[str],
178
- current_page_annotator:object):
 
 
 
179
  '''
180
- Update the visible annotation object with the latest review file information
 
181
  '''
182
- out_image_annotations_state = current_image_annotations_state
183
- out_current_page_annotator = current_page_annotator
184
- gradio_annotator_current_page_number = current_page
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
185
 
 
186
  if not review_df.empty:
187
- #print("review_df just before convert_review_df:", review_df)
188
- # First, check that the image on the current page is valid, replace with what exists in page_sizes object if not
189
- if not gradio_annotator_current_page_number: gradio_annotator_current_page_number = 0
 
190
 
191
- # Check bounding values for current page and page max
192
- if gradio_annotator_current_page_number > 0: page_num_reported = gradio_annotator_current_page_number
193
- elif gradio_annotator_current_page_number == 0: page_num_reported = 1 # minimum possible reported page is 1
194
- else:
195
- gradio_annotator_current_page_number = 0
196
- page_num_reported = 1
197
 
198
- # Ensure page displayed can't exceed number of pages in document
199
- page_max_reported = len(out_image_annotations_state)
200
- if page_num_reported > page_max_reported: page_num_reported = page_max_reported
201
 
202
- page_num_reported_zero_indexed = page_num_reported - 1
203
- out_image_annotations_state = convert_review_df_to_annotation_json(review_df, image_file_paths, page_sizes)
 
 
 
 
204
 
205
- page_image_annotator_object, out_image_annotations_state = replace_images_in_image_annotation_object(out_image_annotations_state, out_image_annotations_state[page_num_reported_zero_indexed], page_sizes, page_num_reported)
 
206
 
207
- out_image_annotations_state[page_num_reported_zero_indexed] = page_image_annotator_object
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
208
 
209
- out_current_page_annotator = out_image_annotations_state[page_num_reported_zero_indexed]
 
 
210
 
211
- return out_current_page_annotator, out_image_annotations_state
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
212
 
213
  def exclude_selected_items_from_redaction(review_df: pd.DataFrame,
214
  selected_rows_df: pd.DataFrame,
@@ -216,7 +370,7 @@ def exclude_selected_items_from_redaction(review_df: pd.DataFrame,
216
  page_sizes:List[dict],
217
  image_annotations_state:dict,
218
  recogniser_entity_dataframe_base:pd.DataFrame):
219
- '''
220
  Remove selected items from the review dataframe from the annotation object and review dataframe.
221
  '''
222
 
@@ -253,149 +407,267 @@ def exclude_selected_items_from_redaction(review_df: pd.DataFrame,
253
 
254
  return out_review_df, out_image_annotations_state, out_recogniser_entity_dataframe_base, backup_review_state, backup_image_annotations_state, backup_recogniser_entity_dataframe_base
255
 
256
- def update_annotator_object_and_filter_df(
257
- all_image_annotations:List[AnnotatedImageData],
258
- gradio_annotator_current_page_number:int,
259
- recogniser_entities_dropdown_value:str="ALL",
260
- page_dropdown_value:str="ALL",
261
- text_dropdown_value:str="ALL",
262
- recogniser_dataframe_base:gr.Dataframe=gr.Dataframe(pd.DataFrame(data={"page":[], "label":[], "text":[], "id":[]}), type="pandas", headers=["page", "label", "text", "id"], show_fullscreen_button=True, wrap=True, show_search='filter', max_height=400, static_columns=[0,1,2,3]),
263
- zoom:int=100,
264
- review_df:pd.DataFrame=[],
265
- page_sizes:List[dict]=[],
266
- doc_full_file_name_textbox:str='',
267
- input_folder:str=INPUT_FOLDER):
268
- '''
269
- Update a gradio_image_annotation object with new annotation data.
270
- '''
271
- zoom_str = str(zoom) + '%'
272
-
273
- #print("all_image_annotations at start of update_annotator_object_and_filter_df[-1]:", all_image_annotations[-1])
274
-
275
- if not gradio_annotator_current_page_number: gradio_annotator_current_page_number = 0
276
-
277
- # Check bounding values for current page and page max
278
- if gradio_annotator_current_page_number > 0: page_num_reported = gradio_annotator_current_page_number
279
- elif gradio_annotator_current_page_number == 0: page_num_reported = 1 # minimum possible reported page is 1
280
- else:
281
- gradio_annotator_current_page_number = 0
282
- page_num_reported = 1
283
 
284
- # Ensure page displayed can't exceed number of pages in document
285
- page_max_reported = len(all_image_annotations)
286
- if page_num_reported > page_max_reported: page_num_reported = page_max_reported
287
 
288
- page_num_reported_zero_indexed = page_num_reported - 1
 
 
 
 
289
 
290
- # First, check that the image on the current page is valid, replace with what exists in page_sizes object if not
291
- page_image_annotator_object, all_image_annotations = replace_images_in_image_annotation_object(all_image_annotations, all_image_annotations[page_num_reported_zero_indexed], page_sizes, page_num_reported)
292
 
293
- all_image_annotations[page_num_reported_zero_indexed] = page_image_annotator_object
294
-
295
- current_image_path = all_image_annotations[page_num_reported_zero_indexed]['image']
 
 
 
296
 
297
- # If image path is still not valid, load in a new image an overwrite it. Then replace all items in the image annotation object for all pages based on the updated information.
298
- page_sizes_df = pd.DataFrame(page_sizes)
299
 
300
- if not os.path.exists(current_image_path):
 
 
301
 
302
- page_num, replaced_image_path, width, height = process_single_page_for_image_conversion(doc_full_file_name_textbox, page_num_reported_zero_indexed, input_folder=input_folder)
303
 
304
- # Overwrite page_sizes values
305
- page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"] = width
306
- page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"] = height
307
- page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_path"] = replaced_image_path
308
-
309
- else:
310
- if not page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"].isnull().all():
311
- width = page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"].max()
312
- height = page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"].max()
313
- else:
314
- image = Image.open(current_image_path)
315
- width = image.width
316
- height = image.height
317
 
 
318
  page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"] = width
319
  page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"] = height
 
 
 
 
 
 
 
 
 
 
320
 
321
- page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_path"] = current_image_path
 
322
 
323
- replaced_image_path = current_image_path
324
 
325
- if review_df.empty: review_df = pd.DataFrame(columns=["image", "page", "label", "color", "xmin", "ymin", "xmax", "ymax", "text", "id"])
326
- review_df.loc[review_df["page"]==page_num_reported, 'image'] = replaced_image_path
 
327
 
328
- # Update dropdowns and review selection dataframe with the updated annotator object
329
- recogniser_entities_list, recogniser_dataframe_out_gr, recogniser_dataframe_modified, recogniser_entities_dropdown_value, text_entities_drop, page_entities_drop = update_recogniser_dataframes(all_image_annotations, recogniser_dataframe_base, recogniser_entities_dropdown_value, text_dropdown_value, page_dropdown_value, review_df.copy(), page_sizes)
330
-
331
- recogniser_colour_list = [(0, 0, 0) for _ in range(len(recogniser_entities_list))]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
332
 
333
- # page_sizes_df has been changed - save back to page_sizes_object
334
- page_sizes = page_sizes_df.to_dict(orient='records')
 
 
 
 
 
 
 
335
 
336
- images_list = list(page_sizes_df["image_path"])
337
- images_list[page_num_reported_zero_indexed] = replaced_image_path
338
 
339
- all_image_annotations[page_num_reported_zero_indexed]['image'] = replaced_image_path
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
340
 
341
- # Multiply out image_annotation coordinates from relative to absolute if necessary
342
- all_image_annotations_df = convert_annotation_data_to_dataframe(all_image_annotations)
 
 
343
 
344
- all_image_annotations_df = multiply_coordinates_by_page_sizes(all_image_annotations_df, page_sizes_df, xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax")
 
 
 
 
345
 
346
- #print("all_image_annotations_df[-1] just before creating annotation dicts:", all_image_annotations_df.iloc[-1, :])
347
 
348
- all_image_annotations = create_annotation_dicts_from_annotation_df(all_image_annotations_df, page_sizes)
 
 
 
349
 
350
- #print("all_image_annotations[-1] after creating annotation dicts:", all_image_annotations[-1])
351
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
352
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
353
 
354
- # Remove blank duplicate entries
355
- all_image_annotations = remove_duplicate_images_with_blank_boxes(all_image_annotations)
 
 
 
 
 
 
356
 
357
- current_page_image_annotator_object = all_image_annotations[page_num_reported_zero_indexed]
358
 
359
- #print("current_page_image_annotator_object that goes into annotator object:", current_page_image_annotator_object)
 
360
 
361
- page_number_reported_gradio = gr.Number(label = "Current page", value=page_num_reported, precision=0)
362
 
363
- ###
364
- # If no data, present a blank page
365
- if not all_image_annotations:
366
- print("No all_image_annotation object found")
367
- page_num_reported = 1
368
 
369
- out_image_annotator = image_annotator(
370
- value = None,
371
- boxes_alpha=0.1,
372
- box_thickness=1,
373
- label_list=recogniser_entities_list,
374
- label_colors=recogniser_colour_list,
375
- show_label=False,
376
- height=zoom_str,
377
- width=zoom_str,
378
- box_min_size=1,
379
- box_selected_thickness=2,
380
- handle_size=4,
381
- sources=None,#["upload"],
382
- show_clear_button=False,
383
- show_share_button=False,
384
- show_remove_button=False,
385
- handles_cursor=True,
386
- interactive=True,
387
- use_default_label=True
388
- )
389
-
390
- return out_image_annotator, page_number_reported_gradio, page_number_reported_gradio, page_num_reported, recogniser_entities_dropdown_value, recogniser_dataframe_out_gr, recogniser_dataframe_modified, text_entities_drop, page_entities_drop, page_sizes, all_image_annotations
391
-
392
  else:
393
- ### Present image_annotator outputs
394
  out_image_annotator = image_annotator(
395
  value = current_page_image_annotator_object,
396
  boxes_alpha=0.1,
397
  box_thickness=1,
398
- label_list=recogniser_entities_list,
399
  label_colors=recogniser_colour_list,
400
  show_label=False,
401
  height=zoom_str,
@@ -408,41 +680,23 @@ def update_annotator_object_and_filter_df(
408
  show_share_button=False,
409
  show_remove_button=False,
410
  handles_cursor=True,
411
- interactive=True
412
  )
413
 
414
- #print("all_image_annotations at end of update_annotator...:", all_image_annotations)
415
- #print("review_df at end of update_annotator_object:", review_df)
416
-
417
- return out_image_annotator, page_number_reported_gradio, page_number_reported_gradio, page_num_reported, recogniser_entities_dropdown_value, recogniser_dataframe_out_gr, recogniser_dataframe_modified, text_entities_drop, page_entities_drop, page_sizes, all_image_annotations
418
-
419
- def replace_images_in_image_annotation_object(
420
- all_image_annotations:List[dict],
421
- page_image_annotator_object:AnnotatedImageData,
422
- page_sizes:List[dict],
423
- page:int):
424
-
425
- '''
426
- Check if the image value in an AnnotatedImageData dict is a placeholder or np.array. If either of these, replace the value with the file path of the image that is hopefully already loaded into the app related to this page.
427
- '''
428
-
429
- page_zero_index = page - 1
430
-
431
- if isinstance(all_image_annotations[page_zero_index]["image"], np.ndarray) or "placeholder_image" in all_image_annotations[page_zero_index]["image"] or isinstance(page_image_annotator_object['image'], np.ndarray):
432
- page_sizes_df = pd.DataFrame(page_sizes)
433
- page_sizes_df[["page"]] = page_sizes_df[["page"]].apply(pd.to_numeric, errors="coerce")
434
-
435
- # Check for matching pages
436
- matching_paths = page_sizes_df.loc[page_sizes_df['page'] == page, "image_path"].unique()
437
-
438
- if matching_paths.size > 0:
439
- image_path = matching_paths[0]
440
- page_image_annotator_object['image'] = image_path
441
- all_image_annotations[page_zero_index]["image"] = image_path
442
- else:
443
- print(f"No image path found for page {page}.")
444
-
445
- return page_image_annotator_object, all_image_annotations
446
 
447
  def update_all_page_annotation_object_based_on_previous_page(
448
  page_image_annotator_object:AnnotatedImageData,
@@ -459,12 +713,9 @@ def update_all_page_annotation_object_based_on_previous_page(
459
  previous_page_zero_index = previous_page -1
460
 
461
  if not current_page: current_page = 1
462
-
463
- #print("page_image_annotator_object at start of update_all_page_annotation_object:", page_image_annotator_object)
464
-
465
- page_image_annotator_object, all_image_annotations = replace_images_in_image_annotation_object(all_image_annotations, page_image_annotator_object, page_sizes, previous_page)
466
-
467
- #print("page_image_annotator_object after replace_images in update_all_page_annotation_object:", page_image_annotator_object)
468
 
469
  if clear_all == False: all_image_annotations[previous_page_zero_index] = page_image_annotator_object
470
  else: all_image_annotations[previous_page_zero_index]["boxes"] = []
@@ -493,7 +744,7 @@ def apply_redactions_to_review_df_and_files(page_image_annotator_object:Annotate
493
  page_image_annotator_object = all_image_annotations[current_page - 1]
494
 
495
  # This replaces the numpy array image object with the image file path
496
- page_image_annotator_object, all_image_annotations = replace_images_in_image_annotation_object(all_image_annotations, page_image_annotator_object, page_sizes, current_page)
497
  page_image_annotator_object['image'] = all_image_annotations[current_page - 1]["image"]
498
 
499
  if not page_image_annotator_object:
@@ -529,7 +780,7 @@ def apply_redactions_to_review_df_and_files(page_image_annotator_object:Annotate
529
  # Check if all elements are integers in the range 0-255
530
  if all(isinstance(c, int) and 0 <= c <= 255 for c in fill):
531
  pass
532
- #print("fill:", fill)
533
  else:
534
  print(f"Invalid color values: {fill}. Defaulting to black.")
535
  fill = (0, 0, 0) # Default to black if invalid
@@ -553,7 +804,6 @@ def apply_redactions_to_review_df_and_files(page_image_annotator_object:Annotate
553
  doc = [image]
554
 
555
  elif file_extension in '.csv':
556
- #print("This is a csv")
557
  pdf_doc = []
558
 
559
  # If working with pdfs
@@ -797,11 +1047,9 @@ def df_select_callback(df: pd.DataFrame, evt: gr.SelectData):
797
 
798
  row_value_df = pd.DataFrame(data={"page":[row_value_page], "label":[row_value_label], "text":[row_value_text], "id":[row_value_id]})
799
 
800
- return row_value_page, row_value_df
801
 
802
  def df_select_callback_textract_api(df: pd.DataFrame, evt: gr.SelectData):
803
-
804
- #print("evt.data:", evt._data)
805
 
806
  row_value_job_id = evt.row_value[0] # This is the page number value
807
  # row_value_label = evt.row_value[1] # This is the label number value
@@ -829,59 +1077,108 @@ def df_select_callback_ocr(df: pd.DataFrame, evt: gr.SelectData):
829
 
830
  return row_value_page, row_value_df
831
 
832
- def update_selected_review_df_row_colour(redaction_row_selection:pd.DataFrame, review_df:pd.DataFrame, previous_id:str="", previous_colour:str='(0, 0, 0)', page_sizes:List[dict]=[], output_folder:str=OUTPUT_FOLDER, colour:str='(1, 0, 255)'):
 
 
 
 
 
 
833
  '''
834
  Update the colour of a single redaction box based on the values in a selection row
 
835
  '''
836
- colour_tuple = str(tuple(colour))
837
 
838
- if "color" not in review_df.columns: review_df["color"] = '(0, 0, 0)'
 
 
 
 
839
  if "id" not in review_df.columns:
840
- review_df = fill_missing_ids(review_df)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
841
 
842
- # Reset existing highlight colours
843
- review_df.loc[review_df["id"]==previous_id, "color"] = review_df.loc[review_df["id"]==previous_id, "color"].apply(lambda _: previous_colour)
844
- review_df.loc[review_df["color"].astype(str)==colour, "color"] = review_df.loc[review_df["color"].astype(str)==colour, "color"].apply(lambda _: '(0, 0, 0)')
845
 
846
  if not redaction_row_selection.empty and not review_df.empty:
847
  use_id = (
848
- "id" in redaction_row_selection.columns
849
- and "id" in review_df.columns
850
- and not redaction_row_selection["id"].isnull().all()
851
  and not review_df["id"].isnull().all()
852
  )
853
 
854
- selected_merge_cols = ["id"] if use_id else ["label", "page", "text"]
855
 
856
- review_df = review_df.merge(redaction_row_selection[selected_merge_cols], on=selected_merge_cols, indicator=True, how="left")
 
 
 
 
 
 
857
 
858
- if "_merge" in review_df.columns:
859
- filtered_reviews = review_df.loc[review_df["_merge"]=="both"]
860
- else:
861
- filtered_reviews = pd.DataFrame()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
862
 
863
- if not filtered_reviews.empty:
864
- previous_colour = str(filtered_reviews["color"].values[0])
865
- previous_id = filtered_reviews["id"].values[0]
866
- review_df.loc[review_df["_merge"]=="both", "color"] = review_df.loc[review_df["_merge"] == "both", "color"].apply(lambda _: colour)
867
  else:
868
- # Handle the case where no rows match the condition
869
- print("No reviews found with _merge == 'both'")
870
- previous_colour = '(0, 0, 0)'
871
- review_df.loc[review_df["color"]==colour, "color"] = previous_colour
872
- previous_id =''
873
 
874
- review_df.drop("_merge", axis=1, inplace=True)
875
 
876
- # Ensure that all output coordinates are in proportional size
877
- #page_sizes_df = pd.DataFrame(page_sizes)
878
- #page_sizes_df .loc[:, "page"] = pd.to_numeric(page_sizes_df["page"], errors="coerce")
879
- #print("review_df before divide:", review_df)
880
- #print("page_sizes_df before divide:", page_sizes_df)
881
- #review_df = divide_coordinates_by_page_sizes(review_df, page_sizes_df)
882
- #print("review_df after divide:", review_df)
883
 
884
- review_df = review_df[["image", "page", "label", "color", "xmin","ymin", "xmax", "ymax", "text", "id"]]
885
 
886
  return review_df, previous_id, previous_colour
887
 
@@ -988,8 +1285,6 @@ def create_xfdf(review_file_df:pd.DataFrame, pdf_path:str, pymupdf_doc:object, i
988
  page_sizes_df = pd.DataFrame(page_sizes)
989
 
990
  # If there are no image coordinates, then convert coordinates to pymupdf coordinates prior to export
991
- #print("Using pymupdf coordinates for conversion.")
992
-
993
  pages_are_images = False
994
 
995
  if "mediabox_width" not in review_file_df.columns:
@@ -1041,33 +1336,9 @@ def create_xfdf(review_file_df:pd.DataFrame, pdf_path:str, pymupdf_doc:object, i
1041
  raise ValueError(f"Invalid cropbox format: {document_cropboxes[page_python_format]}")
1042
  else:
1043
  print("Document cropboxes not found.")
1044
-
1045
 
1046
  pdf_page_height = pymupdf_page.mediabox.height
1047
- pdf_page_width = pymupdf_page.mediabox.width
1048
-
1049
- # Check if image dimensions for page exist in page_sizes_df
1050
- # image_dimensions = {}
1051
-
1052
- # image_dimensions['image_width'] = page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"].max()
1053
- # image_dimensions['image_height'] = page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"].max()
1054
-
1055
- # if pd.isna(image_dimensions['image_width']):
1056
- # image_dimensions = {}
1057
-
1058
- # image = image_paths[page_python_format]
1059
-
1060
- # if image_dimensions:
1061
- # image_page_width, image_page_height = image_dimensions["image_width"], image_dimensions["image_height"]
1062
- # if isinstance(image, str) and 'placeholder' not in image:
1063
- # image = Image.open(image)
1064
- # image_page_width, image_page_height = image.size
1065
- # else:
1066
- # try:
1067
- # image = Image.open(image)
1068
- # image_page_width, image_page_height = image.size
1069
- # except Exception as e:
1070
- # print("Could not get image sizes due to:", e)
1071
 
1072
  # Create redaction annotation
1073
  redact_annot = SubElement(annots, 'redact')
@@ -1345,8 +1616,6 @@ def convert_xfdf_to_dataframe(file_paths_list:List[str], pymupdf_doc, image_path
1345
  # Optionally, you can add the image path or other relevant information
1346
  df.loc[_, 'image'] = image_path
1347
 
1348
- #print('row:', row)
1349
-
1350
  out_file_path = output_folder + file_path_name + "_review_file.csv"
1351
  df.to_csv(out_file_path, index=None)
1352
 
 
6
  from xml.etree.ElementTree import Element, SubElement, tostring, parse
7
  from xml.dom import minidom
8
  import uuid
9
+ from typing import List, Tuple
10
  from gradio_image_annotation import image_annotator
11
  from gradio_image_annotation.image_annotator import AnnotatedImageData
12
  from pymupdf import Document, Rect
13
  import pymupdf
 
14
  from PIL import ImageDraw, Image
15
 
16
  from tools.config import OUTPUT_FOLDER, CUSTOM_BOX_COLOUR, MAX_IMAGE_PIXELS, INPUT_FOLDER
 
54
 
55
  return current_zoom_level, annotate_current_page
56
 
 
57
  def update_dropdown_list_based_on_dataframe(df:pd.DataFrame, column:str) -> List["str"]:
58
  '''
59
  Gather unique elements from a string pandas Series, then append 'ALL' to the start and return the list.
 
164
 
165
  return recogniser_entities_list, recogniser_dataframe_out_gr, recogniser_dataframe_out, recogniser_entities_drop, text_entities_drop, page_entities_drop
166
 
167
+ def undo_last_removal(backup_review_state:pd.DataFrame, backup_image_annotations_state:list[dict], backup_recogniser_entity_dataframe_base:pd.DataFrame):
168
  return backup_review_state, backup_image_annotations_state, backup_recogniser_entity_dataframe_base
169
 
170
+ def update_annotator_page_from_review_df(
171
+ review_df: pd.DataFrame,
172
+ image_file_paths:List[str], # Note: This input doesn't seem used in the original logic flow after the first line was removed
173
+ page_sizes:List[dict],
174
+ current_image_annotations_state:List[str], # This should ideally be List[dict] based on its usage
175
+ current_page_annotator:object, # Should be dict or a custom annotation object for one page
176
+ selected_recogniser_entity_df_row:pd.DataFrame,
177
+ input_folder:str,
178
+ doc_full_file_name_textbox:str
179
+ ) -> Tuple[object, List[dict], int, List[dict], pd.DataFrame, int]: # Correcting return types based on usage
180
  '''
181
+ Update the visible annotation object and related objects with the latest review file information,
182
+ optimizing by processing only the current page's data.
183
  '''
184
+ # Assume current_image_annotations_state is List[dict] and current_page_annotator is dict
185
+ out_image_annotations_state: List[dict] = list(current_image_annotations_state) # Make a copy to avoid modifying input in place
186
+ out_current_page_annotator: dict = current_page_annotator
187
+
188
+ # Get the target page number from the selected row
189
+ # Safely access the page number, handling potential errors or empty DataFrame
190
+ gradio_annotator_current_page_number: int = 0
191
+ annotate_previous_page: int = 0 # Renaming for clarity if needed, matches original output
192
+ if not selected_recogniser_entity_df_row.empty and 'page' in selected_recogniser_entity_df_row.columns:
193
+ try:
194
+ # Use .iloc[0] and .item() for robust scalar extraction
195
+ gradio_annotator_current_page_number = int(selected_recogniser_entity_df_row['page'].iloc[0])
196
+ annotate_previous_page = gradio_annotator_current_page_number # Store original page number
197
+ except (IndexError, ValueError, TypeError):
198
+ print("Warning: Could not extract valid page number from selected_recogniser_entity_df_row. Defaulting to page 0 (or 1).")
199
+ gradio_annotator_current_page_number = 1 # Or 0 depending on 1-based vs 0-based indexing elsewhere
200
+
201
+ # Ensure page number is valid and 1-based for external display/logic
202
+ if gradio_annotator_current_page_number <= 0:
203
+ gradio_annotator_current_page_number = 1
204
+
205
+ page_max_reported = len(out_image_annotations_state)
206
+ if gradio_annotator_current_page_number > page_max_reported:
207
+ gradio_annotator_current_page_number = page_max_reported # Cap at max pages
208
+
209
+ page_num_reported_zero_indexed = gradio_annotator_current_page_number - 1
210
+
211
+ # Process page sizes DataFrame early, as it's needed for image path handling and potentially coordinate multiplication
212
+ page_sizes_df = pd.DataFrame(page_sizes)
213
+ if not page_sizes_df.empty:
214
+ # Safely convert page column to numeric and then int
215
+ page_sizes_df["page"] = pd.to_numeric(page_sizes_df["page"], errors="coerce")
216
+ page_sizes_df.dropna(subset=["page"], inplace=True)
217
+ if not page_sizes_df.empty:
218
+ page_sizes_df["page"] = page_sizes_df["page"].astype(int)
219
+ else:
220
+ print("Warning: Page sizes DataFrame became empty after processing.")
221
 
222
+ # --- OPTIMIZATION: Process only the current page's data from review_df ---
223
  if not review_df.empty:
224
+ # Filter review_df for the current page
225
+ # Ensure 'page' column in review_df is comparable to page_num_reported
226
+ if 'page' in review_df.columns:
227
+ review_df['page'] = pd.to_numeric(review_df['page'], errors='coerce').fillna(-1).astype(int)
228
 
229
+ current_image_path = out_image_annotations_state[page_num_reported_zero_indexed]['image']
 
 
 
 
 
230
 
231
+ replaced_image_path, page_sizes_df = replace_placeholder_image_with_real_image(doc_full_file_name_textbox, current_image_path, page_sizes_df, gradio_annotator_current_page_number, input_folder)
 
 
232
 
233
+ # page_sizes_df has been changed - save back to page_sizes_object
234
+ page_sizes = page_sizes_df.to_dict(orient='records')
235
+ review_df.loc[review_df["page"]==gradio_annotator_current_page_number, 'image'] = replaced_image_path
236
+ images_list = list(page_sizes_df["image_path"])
237
+ images_list[page_num_reported_zero_indexed] = replaced_image_path
238
+ out_image_annotations_state[page_num_reported_zero_indexed]['image'] = replaced_image_path
239
 
240
+ current_page_review_df = review_df[review_df['page'] == gradio_annotator_current_page_number].copy()
241
+ current_page_review_df = multiply_coordinates_by_page_sizes(current_page_review_df, page_sizes_df)
242
 
243
+ else:
244
+ print(f"Warning: 'page' column not found in review_df. Cannot filter for page {gradio_annotator_current_page_number}. Skipping update from review_df.")
245
+ current_page_review_df = pd.DataFrame() # Empty dataframe if filter fails
246
+
247
+ if not current_page_review_df.empty:
248
+ # Convert the current page's review data to annotation list format for *this page*
249
+
250
+ current_page_annotations_list = []
251
+ # Define expected annotation dict keys, including 'image', 'page', coords, 'label', 'text', 'color' etc.
252
+ # Assuming review_df has compatible columns
253
+ expected_annotation_keys = ['label', 'color', 'xmin', 'ymin', 'xmax', 'ymax', 'text', 'id'] # Add/remove as needed
254
+
255
+ # Ensure necessary columns exist in current_page_review_df before converting rows
256
+ for key in expected_annotation_keys:
257
+ if key not in current_page_review_df.columns:
258
+ # Add missing column with default value
259
+ # Use np.nan for numeric, '' for string/object
260
+ default_value = np.nan if key in ['xmin', 'ymin', 'xmax', 'ymax'] else ''
261
+ current_page_review_df[key] = default_value
262
+
263
+ # Convert filtered DataFrame rows to list of dicts
264
+ # Using .to_dict(orient='records') is efficient for this
265
+ current_page_annotations_list_raw = current_page_review_df[expected_annotation_keys].to_dict(orient='records')
266
+
267
+ current_page_annotations_list = current_page_annotations_list_raw
268
+
269
+ # Update the annotations state for the current page
270
+ # Each entry in out_image_annotations_state seems to be a dict containing keys like 'image', 'page', 'annotations' (List[dict])
271
+ # Need to update the 'annotations' list for the specific page.
272
+ # Find the entry for the current page in the state
273
+ page_state_entry_found = False
274
+ for i, page_state_entry in enumerate(out_image_annotations_state):
275
+ # Assuming page_state_entry has a 'page' key (1-based)
276
+
277
+ match = re.search(r"(\d+)\.png$", page_state_entry['image'])
278
+ if match: page_no = int(match.group(1))
279
+ else: page_no = -1
280
+
281
+ if 'image' in page_state_entry and page_no == page_num_reported_zero_indexed:
282
+ # Replace the annotations list for this page with the new list from review_df
283
+ out_image_annotations_state[i]['boxes'] = current_page_annotations_list
284
+
285
+ # Update the image path as well, based on review_df if available, or keep existing
286
+ # Assuming review_df has an 'image' column for this page
287
+ if 'image' in current_page_review_df.columns and not current_page_review_df.empty:
288
+ # Use the image path from the first row of the filtered review_df
289
+ out_image_annotations_state[i]['image'] = current_page_review_df['image'].iloc[0]
290
+ page_state_entry_found = True
291
+ break
292
+
293
+ if not page_state_entry_found:
294
+ # This scenario might happen if the current_image_annotations_state didn't initially contain
295
+ # an entry for this page number. Depending on the application logic, you might need to
296
+ # add a new entry here, but based on the original code's structure, it seems
297
+ # out_image_annotations_state is pre-populated for all pages.
298
+ print(f"Warning: Entry for page {gradio_annotator_current_page_number} not found in current_image_annotations_state. Cannot update page annotations.")
299
+
300
+
301
+ # --- Image Path and Page Size Handling (already seems focused on current page, keep similar logic) ---
302
+ # Get the image path for the current page from the updated state
303
+ # Ensure the entry exists before accessing
304
+ current_image_path = None
305
+ if len(out_image_annotations_state) > page_num_reported_zero_indexed and 'image' in out_image_annotations_state[page_num_reported_zero_indexed]:
306
+ current_image_path = out_image_annotations_state[page_num_reported_zero_indexed]['image']
307
+ else:
308
+ print(f"Warning: Could not get image path from state for page index {page_num_reported_zero_indexed}.")
309
+
310
+
311
+ # Replace placeholder image with real image path if needed
312
+ if current_image_path and not page_sizes_df.empty:
313
+ try:
314
+ replaced_image_path, page_sizes_df = replace_placeholder_image_with_real_image(
315
+ doc_full_file_name_textbox, current_image_path, page_sizes_df,
316
+ gradio_annotator_current_page_number, input_folder # Use 1-based page number
317
+ )
318
 
319
+ # Update state and review_df with the potentially replaced image path
320
+ if len(out_image_annotations_state) > page_num_reported_zero_indexed:
321
+ out_image_annotations_state[page_num_reported_zero_indexed]['image'] = replaced_image_path
322
 
323
+ if 'page' in review_df.columns and 'image' in review_df.columns:
324
+ review_df.loc[review_df["page"]==gradio_annotator_current_page_number, 'image'] = replaced_image_path
325
+
326
+ except Exception as e:
327
+ print(f"Error during image path replacement for page {gradio_annotator_current_page_number}: {e}")
328
+
329
+
330
+ # Save back page_sizes_df to page_sizes list format
331
+ if not page_sizes_df.empty:
332
+ page_sizes = page_sizes_df.to_dict(orient='records')
333
+ else:
334
+ page_sizes = [] # Ensure page_sizes is a list if df is empty
335
+
336
+ # --- Re-evaluate Coordinate Multiplication and Duplicate Removal ---
337
+ # The original code multiplied coordinates for the *entire* document and removed duplicates
338
+ # across the *entire* document *after* converting the full review_df to state.
339
+ # With the optimized approach, we updated only one page's annotations in the state.
340
+
341
+ # Let's assume remove_duplicate_images_with_blank_boxes expects the raw list of dicts state format:
342
+ try:
343
+ out_image_annotations_state = remove_duplicate_images_with_blank_boxes(out_image_annotations_state)
344
+ except Exception as e:
345
+ print(f"Error during duplicate removal: {e}. Proceeding without duplicate removal.")
346
+
347
+
348
+ # Select the current page's annotation object from the (potentially updated) state
349
+ if len(out_image_annotations_state) > page_num_reported_zero_indexed:
350
+ out_current_page_annotator = out_image_annotations_state[page_num_reported_zero_indexed]
351
+ else:
352
+ print(f"Warning: Cannot select current page annotator object for index {page_num_reported_zero_indexed}.")
353
+ out_current_page_annotator = {} # Or None, depending on expected output type
354
+
355
+
356
+ # The original code returns gradio_annotator_current_page_number as the 3rd value,
357
+ # which was potentially updated by bounding checks. Keep this.
358
+ final_page_number_returned = gradio_annotator_current_page_number
359
+
360
+ return (out_current_page_annotator,
361
+ out_image_annotations_state,
362
+ final_page_number_returned,
363
+ page_sizes,
364
+ review_df, # review_df might have its 'page' column type changed, keep it as is or revert if necessary
365
+ annotate_previous_page) # The original page number from selected_recogniser_entity_df_row
366
 
367
  def exclude_selected_items_from_redaction(review_df: pd.DataFrame,
368
  selected_rows_df: pd.DataFrame,
 
370
  page_sizes:List[dict],
371
  image_annotations_state:dict,
372
  recogniser_entity_dataframe_base:pd.DataFrame):
373
+ '''
374
  Remove selected items from the review dataframe from the annotation object and review dataframe.
375
  '''
376
 
 
407
 
408
  return out_review_df, out_image_annotations_state, out_recogniser_entity_dataframe_base, backup_review_state, backup_image_annotations_state, backup_recogniser_entity_dataframe_base
409
 
410
+ def replace_annotator_object_img_np_array_with_page_sizes_image_path(
411
+ all_image_annotations:List[dict],
412
+ page_image_annotator_object:AnnotatedImageData,
413
+ page_sizes:List[dict],
414
+ page:int):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
415
 
416
+ '''
417
+ Check if the image value in an AnnotatedImageData dict is a placeholder or np.array. If either of these, replace the value with the file path of the image that is hopefully already loaded into the app related to this page.
418
+ '''
419
 
420
+ page_zero_index = page - 1
421
+
422
+ if isinstance(all_image_annotations[page_zero_index]["image"], np.ndarray) or "placeholder_image" in all_image_annotations[page_zero_index]["image"] or isinstance(page_image_annotator_object['image'], np.ndarray):
423
+ page_sizes_df = pd.DataFrame(page_sizes)
424
+ page_sizes_df[["page"]] = page_sizes_df[["page"]].apply(pd.to_numeric, errors="coerce")
425
 
426
+ # Check for matching pages
427
+ matching_paths = page_sizes_df.loc[page_sizes_df['page'] == page, "image_path"].unique()
428
 
429
+ if matching_paths.size > 0:
430
+ image_path = matching_paths[0]
431
+ page_image_annotator_object['image'] = image_path
432
+ all_image_annotations[page_zero_index]["image"] = image_path
433
+ else:
434
+ print(f"No image path found for page {page}.")
435
 
436
+ return page_image_annotator_object, all_image_annotations
 
437
 
438
+ def replace_placeholder_image_with_real_image(doc_full_file_name_textbox:str, current_image_path:str, page_sizes_df:pd.DataFrame, page_num_reported:int, input_folder:str):
439
+ ''' If image path is still not valid, load in a new image an overwrite it. Then replace all items in the image annotation object for all pages based on the updated information.'''
440
+ page_num_reported_zero_indexed = page_num_reported - 1
441
 
442
+ if not os.path.exists(current_image_path):
443
 
444
+ page_num, replaced_image_path, width, height = process_single_page_for_image_conversion(doc_full_file_name_textbox, page_num_reported_zero_indexed, input_folder=input_folder)
 
 
 
 
 
 
 
 
 
 
 
 
445
 
446
+ # Overwrite page_sizes values
447
  page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"] = width
448
  page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"] = height
449
+ page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_path"] = replaced_image_path
450
+
451
+ else:
452
+ if not page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"].isnull().all():
453
+ width = page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"].max()
454
+ height = page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"].max()
455
+ else:
456
+ image = Image.open(current_image_path)
457
+ width = image.width
458
+ height = image.height
459
 
460
+ page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"] = width
461
+ page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"] = height
462
 
463
+ page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_path"] = current_image_path
464
 
465
+ replaced_image_path = current_image_path
466
+
467
+ return replaced_image_path, page_sizes_df
468
 
469
+ def update_annotator_object_and_filter_df(
470
+ all_image_annotations:List[AnnotatedImageData],
471
+ gradio_annotator_current_page_number:int,
472
+ recogniser_entities_dropdown_value:str="ALL",
473
+ page_dropdown_value:str="ALL",
474
+ text_dropdown_value:str="ALL",
475
+ recogniser_dataframe_base:gr.Dataframe=None, # Simplified default
476
+ zoom:int=100,
477
+ review_df:pd.DataFrame=None, # Use None for default empty DataFrame
478
+ page_sizes:List[dict]=[],
479
+ doc_full_file_name_textbox:str='',
480
+ input_folder:str=INPUT_FOLDER
481
+ ) -> Tuple[image_annotator, gr.Number, gr.Number, int, str, gr.Dataframe, pd.DataFrame, List[str], List[str], List[dict], List[AnnotatedImageData]]:
482
+ '''
483
+ Update a gradio_image_annotation object with new annotation data for the current page
484
+ and update filter dataframes, optimizing by processing only the current page's data for display.
485
+ '''
486
+ zoom_str = str(zoom) + '%'
487
+
488
+ # Handle default empty review_df and recogniser_dataframe_base
489
+ if review_df is None or not isinstance(review_df, pd.DataFrame):
490
+ review_df = pd.DataFrame(columns=["image", "page", "label", "color", "xmin", "ymin", "xmax", "ymax", "text", "id"])
491
+ if recogniser_dataframe_base is None: # Create a simple default if None
492
+ recogniser_dataframe_base = gr.Dataframe(pd.DataFrame(data={"page":[], "label":[], "text":[], "id":[]}))
493
+
494
+
495
+ # Handle empty all_image_annotations state early
496
+ if not all_image_annotations:
497
+ print("No all_image_annotation object found")
498
+ # Return blank/default outputs
499
+ blank_annotator = gr.ImageAnnotator(
500
+ value = None, boxes_alpha=0.1, box_thickness=1, label_list=[], label_colors=[],
501
+ show_label=False, height=zoom_str, width=zoom_str, box_min_size=1,
502
+ box_selected_thickness=2, handle_size=4, sources=None,
503
+ show_clear_button=False, show_share_button=False, show_remove_button=False,
504
+ handles_cursor=True, interactive=True, use_default_label=True
505
+ )
506
+ blank_df_out_gr = gr.Dataframe(pd.DataFrame(columns=["page", "label", "text", "id"]))
507
+ blank_df_modified = pd.DataFrame(columns=["page", "label", "text", "id"])
508
 
509
+ return (blank_annotator, gr.Number(value=1), gr.Number(value=1), 1,
510
+ recogniser_entities_dropdown_value, blank_df_out_gr, blank_df_modified,
511
+ [], [], [], []) # Return empty lists/defaults for other outputs
512
+
513
+ # Validate and bound the current page number (1-based logic)
514
+ page_num_reported = max(1, gradio_annotator_current_page_number) # Minimum page is 1
515
+ page_max_reported = len(all_image_annotations)
516
+ if page_num_reported > page_max_reported:
517
+ page_num_reported = page_max_reported
518
 
519
+ page_num_reported_zero_indexed = page_num_reported - 1
520
+ annotate_previous_page = page_num_reported # Store the determined page number
521
 
522
+ # --- Process page sizes DataFrame ---
523
+ page_sizes_df = pd.DataFrame(page_sizes)
524
+ if not page_sizes_df.empty:
525
+ page_sizes_df["page"] = pd.to_numeric(page_sizes_df["page"], errors="coerce")
526
+ page_sizes_df.dropna(subset=["page"], inplace=True)
527
+ if not page_sizes_df.empty:
528
+ page_sizes_df["page"] = page_sizes_df["page"].astype(int)
529
+ else:
530
+ print("Warning: Page sizes DataFrame became empty after processing.")
531
+
532
+ # --- Handle Image Path Replacement for the Current Page ---
533
+ # This modifies the specific page entry within all_image_annotations list
534
+ # Assuming replace_annotator_object_img_np_array_with_page_sizes_image_path
535
+ # correctly updates the image path within the list element.
536
+ if len(all_image_annotations) > page_num_reported_zero_indexed:
537
+ # Make a shallow copy of the list and deep copy the specific page dict before modification
538
+ # to avoid modifying the input list unexpectedly if it's used elsewhere.
539
+ # However, the original code modified the list in place, so we'll stick to that
540
+ # pattern but acknowledge it.
541
+ page_object_to_update = all_image_annotations[page_num_reported_zero_indexed]
542
+
543
+ # Use the helper function to replace the image path within the page object
544
+ # Note: This helper returns the potentially modified page_object and the full state.
545
+ # The full state return seems redundant if only page_object_to_update is modified.
546
+ # Let's call it and assume it correctly updates the item in the list.
547
+ updated_page_object, all_image_annotations_after_img_replace = replace_annotator_object_img_np_array_with_page_sizes_image_path(
548
+ all_image_annotations, page_object_to_update, page_sizes, page_num_reported)
549
+
550
+ # The original code immediately re-assigns all_image_annotations.
551
+ # We'll rely on the function modifying the list element in place or returning the updated list.
552
+ # Assuming it returns the updated list for robustness:
553
+ all_image_annotations = all_image_annotations_after_img_replace
554
+
555
+
556
+ # Now handle the actual image file path replacement using replace_placeholder_image_with_real_image
557
+ current_image_path = updated_page_object.get('image') # Get potentially updated image path
558
+
559
+ if current_image_path and not page_sizes_df.empty:
560
+ try:
561
+ replaced_image_path, page_sizes_df = replace_placeholder_image_with_real_image(
562
+ doc_full_file_name_textbox, current_image_path, page_sizes_df,
563
+ page_num_reported, input_folder=input_folder # Use 1-based page num
564
+ )
565
 
566
+ # Update the image path in the state and review_df for the current page
567
+ # Find the correct entry in all_image_annotations list again by index
568
+ if len(all_image_annotations) > page_num_reported_zero_indexed:
569
+ all_image_annotations[page_num_reported_zero_indexed]['image'] = replaced_image_path
570
 
571
+ # Update review_df's image path for this page
572
+ if 'page' in review_df.columns and 'image' in review_df.columns:
573
+ # Ensure review_df page column is numeric for filtering
574
+ review_df['page'] = pd.to_numeric(review_df['page'], errors='coerce').fillna(-1).astype(int)
575
+ review_df.loc[review_df["page"]==page_num_reported, 'image'] = replaced_image_path
576
 
 
577
 
578
+ except Exception as e:
579
+ print(f"Error during image path replacement for page {page_num_reported}: {e}")
580
+ else:
581
+ print(f"Warning: Page index {page_num_reported_zero_indexed} out of bounds for all_image_annotations list.")
582
 
 
583
 
584
+ # Save back page_sizes_df to page_sizes list format
585
+ if not page_sizes_df.empty:
586
+ page_sizes = page_sizes_df.to_dict(orient='records')
587
+ else:
588
+ page_sizes = [] # Ensure page_sizes is a list if df is empty
589
+
590
+ # --- OPTIMIZATION: Prepare data *only* for the current page for display ---
591
+ current_page_image_annotator_object = None
592
+ if len(all_image_annotations) > page_num_reported_zero_indexed:
593
+ page_data_for_display = all_image_annotations[page_num_reported_zero_indexed]
594
+
595
+ # Convert current page annotations list to DataFrame for coordinate multiplication IF needed
596
+ # Assuming coordinate multiplication IS needed for display if state stores relative coords
597
+ current_page_annotations_df = convert_annotation_data_to_dataframe([page_data_for_display])
598
+
599
+
600
+ if not current_page_annotations_df.empty and not page_sizes_df.empty:
601
+ # Multiply coordinates *only* for this page's DataFrame
602
+ try:
603
+ # Need the specific page's size for multiplication
604
+ page_size_row = page_sizes_df[page_sizes_df['page'] == page_num_reported]
605
+ if not page_size_row.empty:
606
+ current_page_annotations_df = multiply_coordinates_by_page_sizes(
607
+ current_page_annotations_df, page_size_row, # Pass only the row for the current page
608
+ xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax"
609
+ )
610
+
611
+ except Exception as e:
612
+ print(f"Warning: Error during coordinate multiplication for page {page_num_reported}: {e}. Using original coordinates.")
613
+ # If error, proceed with original coordinates or handle as needed
614
+
615
+ if "color" not in current_page_annotations_df.columns:
616
+ current_page_annotations_df['color'] = '(0, 0, 0)'
617
+
618
+ # Convert the processed DataFrame back to the list of dicts format for the annotator
619
+ processed_current_page_annotations_list = current_page_annotations_df[["xmin", "xmax", "ymin", "ymax", "label", "color", "text", "id"]].to_dict(orient='records')
620
+
621
+ # Construct the final object expected by the Gradio ImageAnnotator value parameter
622
+ current_page_image_annotator_object: AnnotatedImageData = {
623
+ 'image': page_data_for_display.get('image'), # Use the (potentially updated) image path
624
+ 'boxes': processed_current_page_annotations_list
625
+ }
626
 
627
+ # --- Update Dropdowns and Review DataFrame ---
628
+ # This external function still operates on potentially large DataFrames.
629
+ # It receives all_image_annotations and a copy of review_df.
630
+ try:
631
+ recogniser_entities_list, recogniser_dataframe_out_gr, recogniser_dataframe_modified, recogniser_entities_dropdown_value, text_entities_drop, page_entities_drop = update_recogniser_dataframes(
632
+ all_image_annotations, # Pass the updated full state
633
+ recogniser_dataframe_base,
634
+ recogniser_entities_dropdown_value,
635
+ text_dropdown_value,
636
+ page_dropdown_value,
637
+ review_df.copy(), # Keep the copy as per original function call
638
+ page_sizes # Pass updated page sizes
639
+ )
640
+ # Generate default black colors for labels if needed by image_annotator
641
+ recogniser_colour_list = [(0, 0, 0) for _ in range(len(recogniser_entities_list))]
642
 
643
+ except Exception as e:
644
+ print(f"Error calling update_recogniser_dataframes: {e}. Returning empty/default filter data.")
645
+ recogniser_entities_list = []
646
+ recogniser_colour_list = []
647
+ recogniser_dataframe_out_gr = gr.Dataframe(pd.DataFrame(columns=["page", "label", "text", "id"]))
648
+ recogniser_dataframe_modified = pd.DataFrame(columns=["page", "label", "text", "id"])
649
+ text_entities_drop = []
650
+ page_entities_drop = []
651
 
 
652
 
653
+ # --- Final Output Components ---
654
+ page_number_reported_gradio_comp = gr.Number(label = "Current page", value=page_num_reported, precision=0)
655
 
 
656
 
 
 
 
 
 
657
 
658
+ ### Present image_annotator outputs
659
+ # Handle the case where current_page_image_annotator_object couldn't be prepared
660
+ if current_page_image_annotator_object is None:
661
+ # This should ideally be covered by the initial empty check for all_image_annotations,
662
+ # but as a safeguard:
663
+ print("Warning: Could not prepare annotator object for the current page.")
664
+ out_image_annotator = image_annotator(value=None, interactive=False) # Present blank/non-interactive
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
665
  else:
 
666
  out_image_annotator = image_annotator(
667
  value = current_page_image_annotator_object,
668
  boxes_alpha=0.1,
669
  box_thickness=1,
670
+ label_list=recogniser_entities_list, # Use labels from update_recogniser_dataframes
671
  label_colors=recogniser_colour_list,
672
  show_label=False,
673
  height=zoom_str,
 
680
  show_share_button=False,
681
  show_remove_button=False,
682
  handles_cursor=True,
683
+ interactive=True # Keep interactive if data is present
684
  )
685
 
686
+ # The original code returned page_number_reported_gradio twice;
687
+ # returning the Gradio component and the plain integer value.
688
+ # Let's match the output signature.
689
+ return (out_image_annotator,
690
+ page_number_reported_gradio_comp,
691
+ page_number_reported_gradio_comp, # Redundant, but matches original return signature
692
+ page_num_reported, # Plain integer value
693
+ recogniser_entities_dropdown_value,
694
+ recogniser_dataframe_out_gr,
695
+ recogniser_dataframe_modified,
696
+ text_entities_drop, # List of text entities for dropdown
697
+ page_entities_drop, # List of page numbers for dropdown
698
+ page_sizes, # Updated page_sizes list
699
+ all_image_annotations) # Return the updated full state
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
700
 
701
  def update_all_page_annotation_object_based_on_previous_page(
702
  page_image_annotator_object:AnnotatedImageData,
 
713
  previous_page_zero_index = previous_page -1
714
 
715
  if not current_page: current_page = 1
716
+
717
+ # This replaces the numpy array image object with the image file path
718
+ page_image_annotator_object, all_image_annotations = replace_annotator_object_img_np_array_with_page_sizes_image_path(all_image_annotations, page_image_annotator_object, page_sizes, previous_page)
 
 
 
719
 
720
  if clear_all == False: all_image_annotations[previous_page_zero_index] = page_image_annotator_object
721
  else: all_image_annotations[previous_page_zero_index]["boxes"] = []
 
744
  page_image_annotator_object = all_image_annotations[current_page - 1]
745
 
746
  # This replaces the numpy array image object with the image file path
747
+ page_image_annotator_object, all_image_annotations = replace_annotator_object_img_np_array_with_page_sizes_image_path(all_image_annotations, page_image_annotator_object, page_sizes, current_page)
748
  page_image_annotator_object['image'] = all_image_annotations[current_page - 1]["image"]
749
 
750
  if not page_image_annotator_object:
 
780
  # Check if all elements are integers in the range 0-255
781
  if all(isinstance(c, int) and 0 <= c <= 255 for c in fill):
782
  pass
783
+
784
  else:
785
  print(f"Invalid color values: {fill}. Defaulting to black.")
786
  fill = (0, 0, 0) # Default to black if invalid
 
804
  doc = [image]
805
 
806
  elif file_extension in '.csv':
 
807
  pdf_doc = []
808
 
809
  # If working with pdfs
 
1047
 
1048
  row_value_df = pd.DataFrame(data={"page":[row_value_page], "label":[row_value_label], "text":[row_value_text], "id":[row_value_id]})
1049
 
1050
+ return row_value_df
1051
 
1052
  def df_select_callback_textract_api(df: pd.DataFrame, evt: gr.SelectData):
 
 
1053
 
1054
  row_value_job_id = evt.row_value[0] # This is the page number value
1055
  # row_value_label = evt.row_value[1] # This is the label number value
 
1077
 
1078
  return row_value_page, row_value_df
1079
 
1080
+ def update_selected_review_df_row_colour(
1081
+ redaction_row_selection: pd.DataFrame,
1082
+ review_df: pd.DataFrame,
1083
+ previous_id: str = "",
1084
+ previous_colour: str = '(0, 0, 0)',
1085
+ colour: str = '(1, 0, 255)'
1086
+ ) -> tuple[pd.DataFrame, str, str]:
1087
  '''
1088
  Update the colour of a single redaction box based on the values in a selection row
1089
+ (Optimized Version)
1090
  '''
 
1091
 
1092
+ # Ensure 'color' column exists, default to previous_colour if previous_id is provided
1093
+ if "color" not in review_df.columns:
1094
+ review_df["color"] = previous_colour if previous_id else '(0, 0, 0)'
1095
+
1096
+ # Ensure 'id' column exists
1097
  if "id" not in review_df.columns:
1098
+ # Assuming fill_missing_ids is a defined function that returns a DataFrame
1099
+ # It's more efficient if this is handled outside if possible,
1100
+ # or optimized internally.
1101
+ print("Warning: 'id' column not found. Calling fill_missing_ids.")
1102
+ review_df = fill_missing_ids(review_df) # Keep this if necessary, but note it can be slow
1103
+
1104
+ # --- Optimization 1 & 2: Reset existing highlight colours using vectorized assignment ---
1105
+ # Reset the color of the previously highlighted row
1106
+ if previous_id and previous_id in review_df["id"].values:
1107
+ review_df.loc[review_df["id"] == previous_id, "color"] = previous_colour
1108
+
1109
+ # Reset the color of any row that currently has the highlight colour (handle cases where previous_id might not have been tracked correctly)
1110
+ # Convert to string for comparison only if the dtype might be mixed or not purely string
1111
+ # If 'color' is consistently string, the .astype(str) might be avoidable.
1112
+ # Assuming color is consistently string format like '(R, G, B)'
1113
+ review_df.loc[review_df["color"] == colour, "color"] = '(0, 0, 0)'
1114
 
 
 
 
1115
 
1116
  if not redaction_row_selection.empty and not review_df.empty:
1117
  use_id = (
1118
+ "id" in redaction_row_selection.columns
1119
+ and "id" in review_df.columns
1120
+ and not redaction_row_selection["id"].isnull().all()
1121
  and not review_df["id"].isnull().all()
1122
  )
1123
 
1124
+ selected_merge_cols = ["id"] if use_id else ["label", "page", "text"]
1125
 
1126
+ # --- Optimization 3: Use inner merge directly ---
1127
+ # Merge to find rows in review_df that match redaction_row_selection
1128
+ merged_reviews = review_df.merge(
1129
+ redaction_row_selection[selected_merge_cols],
1130
+ on=selected_merge_cols,
1131
+ how="inner" # Use inner join as we only care about matches
1132
+ )
1133
 
1134
+ if not merged_reviews.empty:
1135
+ # Assuming we only expect one match for highlighting a single row
1136
+ # If multiple matches are possible and you want to highlight all,
1137
+ # the logic for previous_id and previous_colour needs adjustment.
1138
+ new_previous_colour = str(merged_reviews["color"].iloc[0])
1139
+ new_previous_id = merged_reviews["id"].iloc[0]
1140
+
1141
+ # --- Optimization 1 & 2: Update color of the matched row using vectorized assignment ---
1142
+
1143
+ if use_id:
1144
+ # Faster update if using unique 'id' as merge key
1145
+ review_df.loc[review_df["id"].isin(merged_reviews["id"]), "color"] = colour
1146
+ else:
1147
+ # More general case using multiple columns - might be slower
1148
+ # Create a temporary key for comparison
1149
+ def create_merge_key(df, cols):
1150
+ return df[cols].astype(str).agg('_'.join, axis=1)
1151
+
1152
+ review_df_key = create_merge_key(review_df, selected_merge_cols)
1153
+ merged_reviews_key = create_merge_key(merged_reviews, selected_merge_cols)
1154
+
1155
+ review_df.loc[review_df_key.isin(merged_reviews_key), "color"] = colour
1156
+
1157
+ previous_colour = new_previous_colour
1158
+ previous_id = new_previous_id
1159
+ else:
1160
+ # No rows matched the selection
1161
+ print("No reviews found matching selection criteria")
1162
+ # The reset logic at the beginning already handles setting color to (0, 0, 0)
1163
+ # if it was the highlight colour and didn't match.
1164
+ # No specific action needed here for color reset beyond what's done initially.
1165
+ previous_colour = '(0, 0, 0)' # Reset previous_colour as no row was highlighted
1166
+ previous_id = '' # Reset previous_id
1167
 
 
 
 
 
1168
  else:
1169
+ # If selection is empty, reset any existing highlights
1170
+ review_df.loc[review_df["color"] == colour, "color"] = '(0, 0, 0)'
1171
+ previous_colour = '(0, 0, 0)'
1172
+ previous_id = ''
 
1173
 
 
1174
 
1175
+ # Ensure column order is maintained if necessary, though pandas generally preserves order
1176
+ # Creating a new DataFrame here might involve copying data, consider if this is strictly needed.
1177
+ if set(["image", "page", "label", "color", "xmin","ymin", "xmax", "ymax", "text", "id"]).issubset(review_df.columns):
1178
+ review_df = review_df[["image", "page", "label", "color", "xmin","ymin", "xmax", "ymax", "text", "id"]]
1179
+ else:
1180
+ print("Warning: Not all expected columns are present in review_df for reordering.")
 
1181
 
 
1182
 
1183
  return review_df, previous_id, previous_colour
1184
 
 
1285
  page_sizes_df = pd.DataFrame(page_sizes)
1286
 
1287
  # If there are no image coordinates, then convert coordinates to pymupdf coordinates prior to export
 
 
1288
  pages_are_images = False
1289
 
1290
  if "mediabox_width" not in review_file_df.columns:
 
1336
  raise ValueError(f"Invalid cropbox format: {document_cropboxes[page_python_format]}")
1337
  else:
1338
  print("Document cropboxes not found.")
 
1339
 
1340
  pdf_page_height = pymupdf_page.mediabox.height
1341
+ pdf_page_width = pymupdf_page.mediabox.width
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1342
 
1343
  # Create redaction annotation
1344
  redact_annot = SubElement(annots, 'redact')
 
1616
  # Optionally, you can add the image path or other relevant information
1617
  df.loc[_, 'image'] = image_path
1618
 
 
 
1619
  out_file_path = output_folder + file_path_name + "_review_file.csv"
1620
  df.to_csv(out_file_path, index=None)
1621