How Leading Healthcare Providers Are De-Identifying Data for Research

June 11, 2024

Today, healthcare organizations embrace digital transformation to improve service delivery and the effectiveness of healthcare processes. For instance, 96% of acute care hospitals and 86% of office-based physicians in the USA currently use Electronic Health Records (EHRs). These records consist of both structured and unstructured data, such as billing details, discharge summaries, clinical notes, and so on. 

Unfortunately, this raises security concerns. As vast amounts of confidential data are stored electronically, traditional protection methods are proving inadequate. This necessitates the development of more advanced security measures. Data de-identification can help in this regard. De-identified patient information in compliance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule enables significant medical research that benefits society at large. 

This article talks about the ways leading healthcare providers de-identify data for research and some real-world applications. 

What is De-Identification? 

Data de-identification removes or alters personal information from datasets so individuals cannot be identified. This technique is essential in healthcare research, where protecting patient privacy while allowing valuable data to be used for scientific advancements is essential.

There are several methods to de-identify data, but two HIPAA-compliant methods include:


  • Safe Harbor Method: This approach involves removing 18 types of identifiers from healthcare data, such as names and addresses. It also includes phone, fax, medical records, and social security numbers. Some other identifiers include email and IP addresses and full-face photographs. 
  • Expert Determination Method: This method employs statistical or scientific principles to ensure minimal re-identification risk. An expert in the field evaluates the data and applies principles to assess the likelihood of identifying an individual. The expert must also document the methodology and results to support the conclusion that the risk of re-identification is minimal.

Some other methods of de-identification include:

  • Anonymization: This method permanently removes personal identifiers from a dataset so individuals cannot be identified directly or indirectly. This technique ensures the data cannot be traced back to any individual, making re-identification impossible. 
  • Pseudonymization: This technique replaces private identifiers with fake identifiers or pseudonyms. While the data is still linkable to the same individual through these pseudonyms, pseudonymization prevents direct identification.

8 Modes of De-Identifying Reports With Automatic De-Identification Systems

An automatic de-identification system functions as a black box for a clinical data scientist, as it accepts and generates de-identified data. To achieve the best results, data scientists must understand the different operational modes available within the de-identification system. These modes outline how users can operate the system and use its specific functionalities.

Here are eight effective methods:

1. Repository-wide Batch De-Identification

This mode is the default operation mode most institutions use for their existing systems. It involves processing large datasets stored in repositories to remove personal identifiers in bulk. This ensures data is readily available when researchers request it without additional overhead.  

2. On Demand Cohort-Specific De-Identification

In this mode, data is de-identified on demand based on a specific cohort or group of patients upon scientists’ request. In this approach, the data remains protected until needed. Despite the initial delay, the speed of modern systems enables almost instantaneous automatic de-identification. 

3. On Demand De-Identification of Query Results

On-demand de-identification of query results involves embedding the de-identification system within the EHR system. This allows query results to be de-identified in real time before being shown to researchers. This method is likely faster than the cohort-specific approach, as it eliminates waiting for a data manager to de-identify and deliver the data.

4. De-Identification with Patient and Provider Identifiers

In this mode, any details that could identify patients or doctors are removed when medical data is shared for research. It ensures that neither the patients nor the healthcare providers can be identified, enhancing overall privacy protection. Personal identifiers can be provided to the de-identification system in four ways: report-specific, cohort-specific, repository-wide, or a combination of these methods.

5. Scientists-Involved De-Identification

This approach involves scientists collaborating with de-identification systems to review and refine the process. The de-identification system’s ability to identify sensitive information can be manually enhanced by ensuring scientists are directly involved. However, this increased sensitivity may make some non-sensitive information mistakenly identified as sensitive. Scientists can address this by reviewing the initial de-identified results and identifying misidentified terms.

6. Patient-involved De-Identification

In patient-involved de-identification, patients are involved in the de-identification process, which allows them to consent to the removal of their identifiers. However, this mode is hypothetical and not in practice, as no existing systems allow patients to annotate their records for data de-identification. It is suspected that patients may demand increased transparency levels and self-verification in the coming years.

7. Physician-involved De-Identification

Similar to the patient-involved approach, this method involves physicians in the de-identification process. Physicians may sometimes need to reference a patient’s full name and medical record number to connect records. However, this is generally discouraged because it increases the risk of privacy breaches and unauthorized access to sensitive information. With the physician-involved de-identification mode, the system alerts physicians when they include patient identifiers.

8. Online De-Identification by Honest Brokers

As large health databases become more available to researchers, major centers like state cancer registries and government research facilities will likely store and manage them. Smaller institutes can access de-identified data from these larger databases through online de-identification. Acting as honest brokers, these centers remove identifying details from the data, establish usage agreements, and ensure compliance with regulations.

Real World Applications of De-Identification in Healthcare Research

De-identification is crucial in various aspects of healthcare research. It enables researchers to access and analyze data while protecting patient privacy. Some real-world case studies are given below:

1. Automated De-Identification of Large Real-World Clinical Text Datasets

In 2023, an advanced solution for automating the de-identification of large datasets of over one billion clinical notes was experimented with. The system achieved high accuracy and scalability by combining rule-based and deep-learning models, meeting real-world deployment standards.

A hybrid context-based model architecture was proposed, surpassing Named Entity Recognition (NER)-only models in accuracy by 10% on benchmark tests. Compared to leading cloud services and language models, the system showed superior performance and coverage of sensitive data across multiple languages without fine-tuning.

2. UCSF’s Certified De-Identification Pipeline

The University of California (UCSF) released and implemented a certified de-identification HIPAA-compliant pipeline, Philtre V1., in 2021 to de-identify clinical note texts for research. This pipeline has since been implemented, making over 130 million certified de-identified clinical notes accessible to more than 600 UCSF researchers. These notes, spanning 40 years, encompass data from 2757016 UCSF patients.

Philter V1.0 transforms the clinical notes de-identification process by streamlining and automating it, thus enabling scalability for large volumes of unstructured text data. ArcherHall’s algorithmic enhancements and certification techniques have significantly improved Philter’s performance.

3. Corner Real-World Data (CRWD) — A De-identified EHR Database

The Cerner Real-World DataTM (CRWD) is a de-identified big data source of multicenter EHRs. Cerner Corporation ensures compliance with privacy regulations while providing valuable healthcare data for research purposes by securing appropriate data use agreements and permissions from over 100 health systems.

Researchers from academic institutions, healthcare systems, and life sciences sectors can access CRWD if their healthcare organization contributes de-identified data to the dataset. Alternatively, researchers can collaborate with Cerner through a Learning Health Network (LHN) to access HealtheDataLab, a cloud-parallel distributed learning framework, to conduct approved research projects.

Benefits of Data De-Identification in Healthcare

Data de-identification in healthcare offers numerous benefits, including:

  • Protects Patient Confidentiality: Data de-identification ensures that sensitive personal information is removed from medical records. This preserves patient privacy and confidentiality.
  • Supports Healthcare Research: By providing access to anonymized data, data de-identification enables researchers to analyze trends and develop treatments more effectively.
  • Facilitates Public Health Alerts: De-identified data enables researchers to issue timely public health warnings without compromising patient privacy. 
  • Reduces Risk of Data Breaches: Removing sensitive information minimizes the chance of unauthorized access, enhancing data security.
  • Improves Patient Privacy: Data de-identification helps to mitigate the risk of patient information being disclosed or compromised. This improves patient privacy and enables trust between patients and healthcare providers. 

Future Trends in De-Identification 

Looking ahead, the future of data anonymization holds several key trends. These include:

  • Automated Anonymization with AI and ML: AI and machine learning will increasingly automate and improve anonymization processes. These technologies will enable systems to adapt and evolve, ensuring more robust protection of Protected Health Information (PHI).
  • Customizable De-identification Software: Many new software will be designed for de-identification, offering better options to customize the process. This flexibility will enable organizations to tailor de-identification to their specific needs. 
  • Rise of Privacy-Enhancing Technologies: We’ll also likely witness the rise of privacy-focused technologies like homomorphic encryption, differential privacy, and federated learning.
  • Blockchain for Anonymization: One more frequently mentioned notion is using blockchain for anonymization. ​This makes data secure and shared with those who cannot alter it or gain unauthorized access.

These advancements are all about ensuring healthcare data stays private while still being useful for research and other essential purposes.

iMerit De-Identification Solution Empowering Research in Healthcare

Transform your data management with iMerit’s purpose-built De-Identification Solution. Using state-of-the-art technology, iMerit leverages pre-trained NLP modes to detect and de-identify sensitive patient information quickly and accurately. iMerit also offers the option to integrate human expert teams for added verification and review. 

With customizable features and automated workflows, iMerit streamlines your data pipeline and enhances quality control. It also simplifies data sharing while complying with healthcare regulations.

Enhance your research capabilities with iMerit. Explore iMerit today!

Are you looking for data annotation to advance your project? Contact us today.