Why the necessity for scalable picture de-identification
Medical photographs, equivalent to X-rays and MRIs, apart from aiding in prognosis, remedy planning, and illness monitoring, are more and more getting used past particular person affected person care to tell broader medical analysis, public well being coverage, and the event of latest AI-powered diagnostic instruments. This secondary use of medical data, whereas immensely useful, must bear de-identification of protected well being info (PHI) to safeguard affected person privateness and adjust to laws like HIPAA.
The rising scale of medical picture datasets necessitates dependable and environment friendly de-identification strategies, making certain that the photographs could be safely and ethically used to advance medical science. To this finish, we current the Pixels Answer Accelerator with a Spark ML Pipeline leveraging Imaginative and prescient Language Fashions (VLM) in parallel to de-identify medical photographs within the broadly used format, Digital Imaging and Communications in Drugs (DICOM).
A DICOM file accommodates each photographs and metadata textual content (learn extra right here). Right here, we give attention to our new function de-identifying photographs. It’s price noting that Pixels, our DICOM toolkit, additionally de-identifies metadata along with scalable DICOM ingestion, segmentation, all inside an internet utility.
de-identify PHI burned in DICOM photographs
After putting in the Pixels python bundle, run the DicomPhiPipeline as such:
It reads in a DICOM file path in a column in a Spark dataframe and outputs 2 columns:
- a response from a VLM (laid out in endpoint)
- a DICOM file path with PHI masked
As a part of DicomPhiPipeline, redaction is carried out utilizing EasyOCR. Redaction could be carried out independently of VLM PHI detection (redact_even_if_undetected=True) or carried out conditionally on VLM PHI detection (redact_even_if_undetected=False). We suggest the latter as EasyOCR tends to over-redact non-PHI. Conditioning on photographs the VLM has detected as PHI-positive, EasyOCR shall be much less more likely to redact the non-PHI photographs.
Comparability with different PHI detection strategies
The competitors
We examined Pixels’ picture PHI detection pipeline with a business vendor and a broadly used open supply resolution, Presidio. Each the seller and Presidio used OCR to first extract the textual content from the photographs after which apply a language mannequin to categorise if the textual content was PHI or not. The built-in OCR additionally segmented delicate textual content and utilized a fill masks inside these bounding packing containers.
Moreover, we in contrast a number of VLMs: GPT-4o, Claude 3.7 Sonnet, and open-source Llama 4 Maverick.
Datasets
The comparability was performed on public DICOM datasets, MIDI-B the place we downsampled to 70 photographs to create a balanced dataset with roughly equal variety of photographs with PHI and with out.
Outcomes
| Process: PHI detection in DICOM photographs | MIDI-B (70) | ||||
|---|---|---|---|---|---|
| Answer | Value Estimates per 100k photographs |
Recall | Precision | Specificity | NPV |
| ISV (business) | $4,400 per thirty days pay as you go | 1.0 | 0.71 | 0.93 | 1.0 |
| Presidio (OSS) | $0 | 0.7 | 0.7 | 0.95 | 0.95 |
| Claude 3.7 Sonnet | $270 | 1.0 | 1.0 | 1.0 | 1.0 |
| GPT-4o | $150 | 1.0 | 1.0 | 1.0 | 1.0 |
| Llama 4 Maverick (OSS) | $45 | 1.0 | 0.91 | 0.98 | 1.0 |
Each Claude 3.7 Sonnet and GPT-4o had good PHI detection efficiency. Llama 4 Maverick had 100% recall however 91% precision because it typically mis-identifies non-PHI textual content on the picture as PHI. Nonetheless, Llama 4-Maverick nonetheless gives good efficiency particularly for customers who lean in direction of over-redaction to keep away from lacking any PHI. In such a case, it has zero false omission charge of PHI (i.e. NPV near 1) and recall of 1 so it might be a great stability between efficiency and value.
In our checks, we used Presidio and the business resolution out-of-the-box with default settings. We seen that efficiency by way of each accuracy and pace was extremely depending on the OCR alternative. It’s probably their efficiency might be improved with options equivalent to Azure Doc Intelligence.
Why it really works
We surveyed the literature on de-identifying burn-in textual content on medical photographs and realized from the reported success of utilizing OCR, LLMs (e.g. BERT, Bi-LSTM, GPT) and/or VLMs. Our choice to make use of VLM to detect PHI and EasyOCR to detect textual content bounding packing containers was guided by the success reported by Truong et al. 2025.
-
VLM change conventional OCR poor at textual content recognition and infrequently introduce typos
In most de-identification strategies reported, OCR was usually used as step one to extract textual content from photographs enter right into a LLM. Nonetheless, we noticed that OCR instruments like tessaract and EasyOCR have been usually poor and gradual at textual content recognition (i.e. studying), usually mis-reading sure characters and inadvertently introducing typos and compromising downstream PHI detection. To mitigate this, we used a VLM to learn burn-in textual content and classify if the textual content was PHI; the VLMs have been surprisingly good at this.
-
EasyOCR to detect bounding packing containers for redaction when VLMs can’t alter photographs
Nonetheless, VLMs can’t output redacted photographs. Thus, we used OCR to do what it did greatest, i.e. detect textual content, to supply the bounding field coordinates for subsequent masking. It’s price noting that though there have been latest makes an attempt to fine-tune a VLM to output bounding field coordinates Chen et al. 2025, we opted for a less complicated resolution assembling off-the-shelf instruments (VLM, EasyOCR) as a substitute.
-
Spark parallelism for production-grade scalability
Whereas Databricks had a batch inferencing functionality with LLMs (ai_functions), it at the moment lacks help for VLMs. As such, we applied a scalable model for VLM and EasyOCR utilizing Pandas UDF. Working with a big pharmaceutical buyer, Spark parallelism sped up their de-identification course of from 105 minutes to six minutes for a trial run of 1000 DICOM frames! Scaling as much as their full workload of 100,000 DICOM frames, the pace up and value financial savings have been vital.
Abstract
Given the facility, ease and economics of VLMs as demonstrated by the Pixels 2.0 resolution accelerator add-ons, it’s not solely possible however prudent to guard your essential scientific research and associated picture research with scalable PHI detection.
Whereas Pixels is designed for DICOM recordsdata, we discovered our clients adapting it for different picture codecs in JPEG, Entire Slide Photographs, SVS and so forth.
The updates are posted to our github repo so now is an effective time to replace or check out the Databricks Pixels 2.0 resolution accelerator. Attain out to your Databricks account group to debate your imaging knowledge processing and AI/ML use instances. The authors could be pleased to listen to from you over LinkedIn if we haven’t already been launched.
