Medical_Lab_Test_Extraction_Pipeline / entity_recognition.py
Prashasst's picture
Update entity_recognition.py
42e4a8b verified
raw
history blame
13.4 kB
import json
from config import google_api
import os
import base64
from google import genai
from google.genai import types
def process_text(extracted_text):
client = genai.Client(
api_key=google_api,
)
model = "gemini-2.0-flash"
contents = [
types.Content(
role="user",
parts=[
types.Part.from_text(text="""**Instruction:**
You are an advanced AI model specializing in medical data extraction. Given an unstructured OCR-extracted text from a medical lab report, your task is to:
1. **Correct Errors**
- Fix missing decimals, incorrect test names, and incorrect reference ranges.
- Ensure test values fall within valid medical reference ranges.
2. **Extract and Structure Data**
- Extract **metadata** (patient details) and **lab report data** in structured JSON format.
- Maintain consistency in naming conventions and JSON structure.
3. **Assign Status Labels**
- **GREEN**: Value is within the normal range.
- **AMBER**: Borderline or slightly out of range.
- **RED**: Critical or significantly out of range.
### **JSON Output Format (Strictly Follow This Structure)**
```json
{
\"metadata\": {
\"patient_name\": \"<Corrected Name>\",
\"age\": \"<Age>\",
\"gender\": \"<Male/Female>\",
\"lab_name\": \"<Lab Name>\",
\"report_date\": \"<DD-MM-YYYY>\"
},
\"report\": [
{
\"test_type\": \"<HEMOGRAM / BIOCHEMISTRY / OTHER>\",
\"lab_tests\": [
{
\"test_name\": \"<Corrected Test Name>\",
\"value\": \"<Numerical Value>\",
\"unit\": \"<Unit>\",
\"reference_range\": \"<Lower Limit - Upper Limit Unit>\",
\"status\": \"<GREEN / AMBER / RED>\"
}
]
}
]
}
```
###EXTRACTED TEXT :
Dr. Onkar Test Sanjeevan Iospital MNES Mn) No:Tiz 12/4 Paud Racid Kothrud Fune - 4V102 Ph: 02025262+5,8983390126, Tlmins: 09.15 AM 0z.30 PMOS.30PM OY_OPAAPPOINTMENTS ONLY | Closed: Mondjy Fridwy Ftent UID: 67 Report No: UOOI8 Nane: AMAF SHAHA (Mle) DIc 02-lul-20 73e 40 years Sample CollectedAc HoqitLb Mddress; MG Rozd FUNE Simple Type/Quantly: Blood Ref. By Doctor Sumnple Collexulon DT: 2-Jul-20, 950AV Dc . Amlt Dcshmukh Tesl Fesult DT: 0)-Jul-zo. +53PI HEMOGRAM IvesTGATiUR RESULT UNM REF. MINGE HaLMOGiOMN 14 guts/a 120.170 KRC coint 4 nlfcunm 41 51 HAEMOTOCRIT (PcW) 30 9u 320.470 MCV 78 n 760.i00, Mch H 32 6 200.320 McHc 32 A 315.365 Rdw 13 9 ; 116.I50 MPv 112 6A . 126 WKC CNT 4567 IcuMIM ACOO Tiuoo DIFTERENTML CnT MeuTKOHILS 56 10 - 70 LyMPOCTTES 20 200 450 NONcytis 1 4.6 RASOPHILS 0.1 PETELETS 273+00 fcu mm I50oou -+Ouuou PEIIPHERIL SMEAIL EXAMINATION RBC Mopholegy Nonmocyulc Normaxchronulc PMTELETS Adequate RS For "'Jnsites Not Scen TEST DONE ON FULLY AUTOMATED) FWVE PART HAEMOTOLOGY ANALYSER XTIROO , SYSMEXHPENTRA XLR HORRIRIBA (DfierentaalWBC Cunt [s correlted microscoplelly) H High, L Loiv ) END OF REPORT 8 Rv Rc
### **Key Constraints:**
- **DO NOT** include extra metadata fields beyond `patient_name`, `age`, `gender`, `lab_name`, and `report_date`.
- **DO NOT** include extra test fields beyond `test_name`, `value`, `unit`, `reference_range`, and `status`.
- **ENSURE** numerical values align with real-world medical reference ranges.
- **FORMAT OUTPUT** strictly in JSON, ensuring proper nesting.
"""),
],
),
types.Content(
role="model",
parts=[
types.Part.from_text(text="""{
\"metadata\": {
\"patient_name\": \"Amar Shaha\",
\"age\": \"40 years\",
\"gender\": \"Male\",
\"lab_name\": \"Sanjeevan Hospital\",
\"report_date\": \"02-07-2023\"
},
\"report\": [
{
\"lab_tests\": [
{
\"reference_range\": \"12 - 17 g/dL\",
\"status\": \"GREEN\",
\"test_name\": \"Hemoglobin\",
\"unit\": \"g/dL\",
\"value\": \"14.1\"
},
{
\"reference_range\": \"4 - 6 million/cu.mm\",
\"status\": \"GREEN\",
\"test_name\": \"RBC Count\",
\"unit\": \"million/cu.mm\",
\"value\": \"4.8\"
},
{
\"reference_range\": \"37 - 52 %\",
\"status\": \"AMBER\",
\"test_name\": \"Hematocrit (PCV)\",
\"unit\": \"%\",
\"value\": \"36.9\"
},
{
\"reference_range\": \"76 - 100 fL\",
\"status\": \"GREEN\",
\"test_name\": \"MCV\",
\"unit\": \"fL\",
\"value\": \"78.1\"
},
{
\"reference_range\": \"27 - 32 pg\",
\"status\": \"RED\",
\"test_name\": \"MCH\",
\"unit\": \"pg\",
\"value\": \"32.6\"
},
{
\"reference_range\": \"31.5 - 36.5 g/dL\",
\"status\": \"GREEN\",
\"test_name\": \"MCHC\",
\"unit\": \"g/dL\",
\"value\": \"32.8\"
},
{
\"reference_range\": \"11.6 - 15\",
\"status\": \"GREEN\",
\"test_name\": \"RDW\",
\"unit\": \"%\",
\"value\": \"13.9\"
},
{
\"reference_range\": \"7.2 - 12.9 fL\",
\"status\": \"AMBER\",
\"test_name\": \"MPV\",
\"unit\": \"fL\",
\"value\": \"11.2\"
},
{
\"reference_range\": \"4000 - 11000 /cu.mm\",
\"status\": \"GREEN\",
\"test_name\": \"WBC Count\",
\"unit\": \"/cu.mm\",
\"value\": \"4567\"
},
{
\"reference_range\": \"150000 - 450000 /cu.mm\",
\"status\": \"GREEN\",
\"test_name\": \"Platelets\",
\"unit\": \"/cu.mm\",
\"value\": \"273000\"
}
],
\"test_type\": \"HEMOGRAM\"
},
{
\"lab_tests\": [
{
\"reference_range\": \"40 - 70 %\",
\"status\": \"GREEN\",
\"test_name\": \"Neutrophils\",
\"unit\": \"%\",
\"value\": \"56.1\"
},
{
\"reference_range\": \"20 - 45 %\",
\"status\": \"GREEN\",
\"test_name\": \"Lymphocytes\",
\"unit\": \"%\",
\"value\": \"20.2\"
},
{
\"reference_range\": \"2 - 10 %\",
\"status\": \"RED\",
\"test_name\": \"Monocytes\",
\"unit\": \"%\",
\"value\": \"14.6\"
},
{
\"reference_range\": \"0 - 1 %\",
\"status\": \"GREEN\",
\"test_name\": \"Basophils\",
\"unit\": \"%\",
\"value\": \"0.1\"
}
],
\"test_type\": \"DIFFERENTIAL COUNT\"
},
{
\"lab_tests\": [
{
\"reference_range\": \"Normal\",
\"status\": \"GREEN\",
\"test_name\": \"RBC Morphology\",
\"unit\": \"N/A\",
\"value\": \"Normocytic Normochromic\"
},
{
\"reference_range\": \"Normal\",
\"status\": \"GREEN\",
\"test_name\": \"Platelet Morphology\",
\"unit\": \"N/A\",
\"value\": \"Adequate\"
}
],
\"test_type\": \"PERIPHERAL SMEAR EXAMINATION\"
}
]
}"""),
],
),
types.Content(
role="user",
parts=[
types.Part.from_text(text=extracted_text),
],
),
]
generate_content_config = types.GenerateContentConfig(
temperature=1,
top_p=0.95,
top_k=40,
max_output_tokens=8192,
response_mime_type="application/json",
response_schema=genai.types.Schema(
type = genai.types.Type.OBJECT,
required = ["metadata", "report"],
properties = {
"metadata": genai.types.Schema(
type = genai.types.Type.OBJECT,
required = ["patient_name", "age", "gender", "lab_name", "report_date"],
properties = {
"patient_name": genai.types.Schema(
type = genai.types.Type.STRING,
),
"age": genai.types.Schema(
type = genai.types.Type.STRING,
),
"gender": genai.types.Schema(
type = genai.types.Type.STRING,
enum = ["Male", "Female", "Other"],
),
"lab_name": genai.types.Schema(
type = genai.types.Type.STRING,
),
"report_date": genai.types.Schema(
type = genai.types.Type.STRING,
),
},
),
"report": genai.types.Schema(
type = genai.types.Type.ARRAY,
items = genai.types.Schema(
type = genai.types.Type.OBJECT,
required = ["test_type", "lab_tests"],
properties = {
"test_type": genai.types.Schema(
type = genai.types.Type.STRING,
),
"lab_tests": genai.types.Schema(
type = genai.types.Type.ARRAY,
items = genai.types.Schema(
type = genai.types.Type.OBJECT,
required = ["test_name", "value", "unit", "reference_range", "status"],
properties = {
"test_name": genai.types.Schema(
type = genai.types.Type.STRING,
),
"value": genai.types.Schema(
type = genai.types.Type.STRING,
),
"unit": genai.types.Schema(
type = genai.types.Type.STRING,
),
"reference_range": genai.types.Schema(
type = genai.types.Type.STRING,
),
"status": genai.types.Schema(
type = genai.types.Type.STRING,
enum = ["GREEN", "AMBER", "RED"],
),
},
),
),
},
),
),
},
),
system_instruction=[
types.Part.from_text(text="""You are an advanced medical data extraction AI designed to process unstructured OCR text from medical lab reports. Your task is to correct errors in test names, values, and reference ranges while ensuring all values align with real-world medical standards. Extract metadata and lab test data in a structured JSON format, strictly following the predefined schema. Assign status labels (GREEN, AMBER, RED) based on whether test values fall within, near, or outside the reference range. Do not add extra fields or modify reference ranges unless corrections are needed for accuracy. Ensure consistent formatting, valid numerical values, and a properly structured JSON output without any deviations."""),
],
)
try:
response = client.models.generate_content(
model=model, contents=contents, config=generate_content_config
)
json_response = response.text # Ensure response is JSON formatted
parsed_json = json.loads(json_response) # Convert JSON string to Python dictionary
return parsed_json
except json.JSONDecodeError:
print("Error: Invalid JSON response from the model.")
return None