De-Identification Software Tools for Healthcare Data: A Comparative Review

May 01, 2024

Data is the foundation of clinical research, crucial for advancing healthcare. With healthcare organizations increasingly investing in AI-powered solutions, using multimodal data for various purposes, including AI model training and analytics, has become a common practice. However, leveraging clinical data containing Protected Health Information (PHI) poses a significant challenge for organizations. 

This privacy concern slows down research progress and limits insights from analytics. Data containing PHI becomes locked away, inaccessible, and unsharable, blocking collaboration efforts and hindering potential breakthroughs. 

To overcome these challenges, healthcare organizations often need to de-identify data for secondary purposes like analysis, research, or business applications. Yet, removing PHI from unstructured data is a complex task. Traditional de-identification methods, though effective, are often burdensome, time-consuming, and expensive.

As healthcare data grows in volume and complexity, so does the demand for more efficient de-identification solutions. That’s where data de-identification software tools come in: these innovative solutions streamline processes and reduce costs for healthcare organizations. 

In this comparative review, we’ll explore six de-identification tools, assessing their effectiveness in protecting PHI.

Understanding De-Identification in Healthcare Data

De-identification is the process of removing or altering personally identifiable information (PII) to reduce the likelihood of linking individuals’ identities with specific data. De-identification for healthcare data refers to removing or altering identifiers from patient records, particularly PHI, to protect individual privacy while retaining the data’s utility for research, analysis, and other purposes. 

For example, in 2021, a group of healthcare providers collaborated to establish Truveta, a company dedicated to leveraging big data analytics to enhance care insights. By combining de-identified data from tens of millions of patients across thousands of care facilities in the United States, Truveta facilitated the availability of large datasets for medical research.

Data de-identification is crucial in healthcare due to legal and ethical considerations surrounding patient confidentiality. De-identified data is vital to healthcare data management because it:

  • Protects patient privacy: De-identification protects patients’ privacy by removing any identifiers that could reveal personal information without consent.
  • Facilitates research and analysis: By anonymizing patient data, de-identification allows researchers and analysts to access valuable healthcare information for studies. 
  • Supports data sharing: De-identified data can be shared more freely among researchers, healthcare providers, and other stakeholders. This promotes collaboration and innovation in healthcare.
  • Enhances regulatory compliance: De-identification must comply with regulations like the Health Insurance Portability and Accountability Act (HIPAA) and the General Data Protection Regulation (GDPR). This helps organizations avoid legal and financial consequences related to data privacy and security.

6 Software Tools to De-Identify Healthcare Data

De-identification software tools help de-identify healthcare data by anonymizing or removing PII and PHI. This ensures compliance with regulations while preserving data utility for analysis and research. 

Below is a list of a few de-identification software tools.

1. iMerit Ango Hub

iMerit Ango Hub is a purpose-built tool designed to automate the de-identification of sensitive healthcare information. Leveraging pre-trained natural language processing (NLP) models, the tool automates the detection and protection of PHI by blurring and obscuring sensitive data for privacy.

Automation with expert verification in iMerit Ango Hub

  • Automated de-identification
  • Optional human review and verification for quality assurance
  • Analytics and reporting to monitor quality and track progress
  • Simplified data sharing

2. Google Healthcare API

The Google Healthcare API, also known as the Cloud Healthcare API, detects sensitive data within healthcare data formats like Digital Imaging and Communications in Medicine (DICOM) instances and Fast Healthcare Interoperability Resources (FHIR), PHI. Utilizing de-identification transformations, google healthcare API masks, deletes, or obscures this data to ensure privacy.

Google Healthcare API

It operates on a serverless infrastructure, enabling seamless scalability to manage large datasets efficiently to enhance operational efficiency and facilitate advanced research and analysis.

  • Scalable and secure solution
  • Integrates with other Google Cloud services
  • HIPAA compliant
  • Occasional lag
  • Complex API

3. AWS Comprehend Medical

Amazon Comprehend Medical is a NLP service tailored for medical text analysis, offering robust capabilities for de-identification. By analyzing unstructured clinical notes, summaries, case notes, and test results, it swiftly detects and extracts valuable medical information while identifying PHI through its advanced NLP features.

Amazon Comprehend Medical De-identification Architecture

Comprehend Medical’s HIPAA-eligible capabilities ensure accurate recognition of medically sensitive data, enabling the discovery of clinical patterns and trends within the text.

  • Flexible and scalable solution
  • Integrates with other AWS tools and services
  • Accurate identification of medical information
  • Less intuitive user interface (UI)

4. IBM InfoSphere Optim

IBM InfoSphere is a comprehensive solution that masks complex data and anonymizes personally identifiable information (PII) such as names, addresses, and medical records to uphold patient privacy. It can de-identify vast volumes of data by concealing confidential information through masking and pseudonymization techniques.

IBM InfoSphere Optim Data Masking

IBM InfoSphere Optim can mask sensitive data across nonproduction environments, including development, testing, or training settings.

  • Easy access, connectivity, and data masking
  • Flexibility and precision in data management
  • Data masking techniques are available, including format-preserving encryption, substitution, and shuffling
  • Complex UI
  • Initial learning curve


5. Anonos Data Embassy

Anonos Data Embassy software platform uses a combination of de-identification techniques to uphold data privacy and security while facilitating expanded data flow and access. Integrating ten de-identification techniques, the Data Embassy platform transforms source data into Variant Twins (protected outputs) with minimized identifying information yet retaining analytical value.

Anonos Data Embassy Platform

Anonos offers statutory pseudonymization within its suite of data protection technologies, enabling organizations to unlock the potential and value of sensitive assets while mitigating risks.

  • AI-enabled data protection and accuracy
  • Reduced data access time
  • Cloud or on-premise deployment
  • Secure and compliant healthcare data sharing 
  • No documentation is available for training

6. Private AI

Private AI offers a comprehensive de-identification solution designed to accurately identify, anonymize, and replace over 50 entities of PII. This enables organizations to safeguard data, extract valuable insights, and ensure compliance with global privacy regulations such as GDPR, California Privacy Rights Acts (CPRA), and HIPAA.

Private AI architecture

With deployment options including on-premise and support for a wide range of file types, including text, PDFs, images, and audio, Private AI empowers healthcare organizations to protect sensitive information across various data formats.

  • No third-party access
  • Can process 70,000 words per second
  • Multilingual support for up to 52 languages
  • Less than half the error rate compared to alternatives
  • Expensive compared to alternatives
  • Steep learning curve

Comparative Analysis of De-Identification Software Tools

Below is a comparative analysis of various de-identification software tools, highlighting supported data types, de-identification techniques, and overall ratings.

A Hybrid Approach to Data De-Identification 

For secure and compliant healthcare data management, iMerit offers a versatile de-identification-as-a-service solution, integrating advanced AI capabilities with human oversight. Leveraging NLP-based PHI de-identification, the automated workflow efficiently identifies and redacts sensitive information from various documents, ensuring compliance with regulations such as HIPAA and GDPR. 

Moreover, iMerit provides options to add a verification layer through Human in the Loop (HiTL), allowing healthcare data specialists to rectify any misidentifications and ensure complete anonymity of all entities. This hybrid approach combines the efficiency of automation with the nuance of human expertise, offering flexibility in adjusting the level of automation and oversight based on specific requirements. 

Additionally, iMerit’s Ango Hub, with its AI-assisted features, streamlines data labeling processes. This ensures high-quality annotations for training AI models while optimizing workflow efficiency.

Final Thoughts

Data de-identification in healthcare helps mitigate risk exposure and secure individuals’ privacy. When data is de-identified effectively, organizations may not be mandated to report data breaches or leaks, thus minimizing potential liabilities. De-identification facilitates data reuse, enabling secure data licensing arrangements. 

For instance, pharmaceutical companies can leverage de-identified patient data under HIPAA to conduct insightful analyses on trends and prescription patterns. This contributes to the validation of drug efficacy and the identification of market opportunities.

However, choosing the right tool for de-identification is crucial to ensuring healthcare data privacy and regulatory compliance. While technology helps automate processes and enhance efficiency, human oversight remains essential for addressing nuances and ensuring accuracy. 

iMerit offers comprehensive solutions that integrate advanced AI capabilities with human expertise, providing a hybrid approach. 

To de-identify PHI, try the iMerit Ango Hub today!

Are you looking for data annotation to advance your project? Contact us today.