Tackling Healthcare Data De-identification: Structured vs Unstructured

May 21, 2024

Healthcare organizations generate vast amounts of data at high velocity. This data exists in many forms including patient records, laboratory results, imaging scans, wearable devices, pharmacy records, and more. Some of it is neatly organized into tables, making it easier to work with. However, the bigger chunk is unstructured and comprises physicians’ notes, medical images, and surgical recordings.

These diverse forms of medical data offer tremendous potential for research and innovation. Medical data comes in handy when researchers work on discovering new drugs and treatments. Additionally, insurance companies can use medical data to assess risks and make coverage decisions. 

However, data must be de-identified before being shared and used safely. De-identification removes identifiable information such as names, contact details, and specific medical histories to protect patient privacy. This ensures that the data is used responsibly and complies with privacy regulations like the Health Insurance Portability and Accountability Act (HIPAA). De-identification means researchers and decision-makers use data to its fullest without compromising patients’ confidentiality.

While it is necessary to make data shareable for secondary uses, de-identification isn’t straightforward, especially in the case of unstructured data. Let’s explore the challenges of de-identification and how best to overcome them. 

Challenges in De-Identifying Structured Healthcare Data

Structured data refers to organized and available data in a pre-defined format. It adheres to a fixed schema, like a table in a database. Datasets containing patient demographics, diagnosis codes, and treatment histories are all structured data.

Image Source: ResearchGate 

Since structured data has a clear format, it’s comparatively more straightforward to de-identify. However, deciding what data points to remove while ensuring regulatory compliance and maintaining data’s usefulness for secondary purposes is challenging. In the United States, HIPAA is the regulation that governs patients’ privacy by outlining the rules to de-identify sensitive information.

These regulations necessitate identifying and removing two types of information:

  1. Personally Identifiable Information (PII): Data that could directly identify an individual, such as names, addresses, and social security numbers.
  2. Protected Health Information (PHI): Health-related information linked to an individual, like medical records, treatment details, and insurance information.

Techniques to De-Identify Structured Healthcare Data

Certain techniques can be used to tackle these challenges that limit the utility of structured datasets in healthcare. Here are four key techniques to de-identify structured healthcare data:

Remove Direct Identifiers

Direct identifiers or PII are information that can directly identify an individual, such as names, addresses, and social security numbers. This technique suppresses the values of directly identifying variables by removing the corresponding columns from the dataset. 

Mask Direct Identifiers

In some research, sharing research outcomes with patients might be necessary. In such cases, values of direct identifiers can be transformed using techniques like pseudonymization (replacing identifiers with codes) or encryption (data scrambling). This ensures that the original identifiers are not accessible for unauthorized use but are safely maintained in a linked database table. 


Generalization involves removing precision from data values to create more generalized categories. For example, specific dates may be generalized to months or years, and age ranges may be broadened into intervals. This process reduces the granularity of the data while still preserving its utility for analysis. Care must be taken to ensure uniformity in the generalization process and to avoid overlap between categories.


Suppression targets the removal of data elements called Quasi-Identifiers (QIs). These are elements in a dataset that, combined with other available information, can pose the risk of re-identification. QIs include ZIP codes, medical record numbers, medical conditions, rare procedures, etc. Suppression can occur at different levels, from individual cell values to entire rows or sets of quasi-identifiers. While minimizing information loss is crucial for data utility, careful suppression selection is vital for effective de-identification without compromising data integrity.   

Challenges in De-identifying Unstructured Healthcare Data

Unstructured healthcare data such as medical images, scans, and free-form text, constitute 80% of the data generated in the healthcare industry. Medical research relies on this data immensely. Analyzing it gives insights into disease lifecycles and treatment efficacy, ultimately improving healthcare delivery.

Despite its abundance and potential, unstructured data’s unpredictable format typically makes it more challenging to de-identify than structured data. 

Image Source: Stockvault

The lack of a predefined structure makes identifying and removing PII more resource-intensive. The volume and variety of unstructured data necessitate even more robust de-identification strategies to enable its utilization in research and analysis while safeguarding patient privacy. 

Techniques to De-Identify Unstructured Healthcare Data

Overcoming these challenges demands implementing robust techniques to de-identify unstructured healthcare data. The three most common techniques in de-identifying include:

Image Redaction

This technique involves modifying or removing sensitive information from medical images. Image redaction removes patient identifiers and anatomical features from images. Identifiers are often replaced with special characters, such as (*). Anatomical features are blurred or pixelated to protect patient privacy. Best practices are followed to remove sensitive information from images so that the diagnostic utility of images is retained. 

Data Perturbation

Data perturbation involves introducing controlled noise or alterations to unstructured data. This technique makes it difficult to identify individuals but preserves the statistical properties of data for analysis. 

Machine learning

Machine learning (ML) algorithms are powerful tools for de-identifying healthcare data. It can be trained to identify and remove personal information from unstructured healthcare data automatically. Using computers for de-identification speeds up the process significantly, eliminating the need for humans to handle each image or file individually. It also prevents the risks of privacy breaches. 

Overcome Challenges in De-identification by Automation

Technology and automation accelerate and improve the efficacy of de-identification in healthcare data. ML algorithms are trained on lots of data and can quickly spot and anonymize personal identifiers. 

However, human oversight is crucial to maintain precision. With humans in the loop (HiTL), errors can be flagged, complex cases can be addressed, and quality can be ensured. This combination of technology and human expertise accelerates and improves the accuracy and reliability of de-identification. 

Final Thoughts

Unlocking the research potential of data and ensuring compliance in doing so is only possible through effective de-identification. From structured datasets to unstructured medical images and free-form text, the challenges vary, but with the right techniques and automation, they can be overcome. Using artificial intelligence (AI) and ML for automated de-identification, supplemented by human oversight for quality assurance, certifies reliable results.

iMerit offers the resources needed to de-identify data accurately and drive medical innovation. Choose iMerit and start de-identifying data today!

Are you looking for data annotation to advance your project? Contact us today.