The Impact of GDPR on Healthcare Data De-Identification: What You Need to Know

April 17, 2024

A person’s healthcare information is a personal entity that must be respected and protected by the handling authorities. It includes the individual’s personally identifiable information (PII), healthcare history, present conditions, and medications. Healthcare data contains sensitive information and is to be properly regulated as required by global and local compliance authorities, such as GDPR.

One such regulation is de-identifying all information if it is to be stored or used. De-identification assures anonymity to the patient by removing any personal information linked with their records. It allows authorities to store and process the data and maintains patient privacy and trust.

This article discusses healthcare data de-identification as proposed by GDPR. It touches upon the importance of anonymity and what organizations must do to fulfill these requirements.

GDPR Overview for Healthcare

The General Data Protection Regulation (GDPR) was established in 2018 and highlighted strict guidelines for data regulation. GDPR regulations apply to all organizations that deal with data related to people in the EU. They aim to protect users’ personal information and provide a regulated environment for businesses to operate in.

GDPR guidelines protect individuals’ privacy and provide them with transparency into how their Personal Identifiable Information (PII) is utilized. It has very specific and strict guidelines for healthcare data as it contains sensitive information. Some key GDPR compliances include:

  • Implementing Data Privacy Measures: GDPR requires all healthcare organizations to implement strong security protocols to ensure the secure storage and transfer of individual data. It also requires the organization to provide proof of implementation to be certified as GDPR compliant.
  • Transparency to Patients on Data Use: All patients have the right to be briefed on how their information is being used, whether it is for medical or any other purpose.
  • Consensual Use of Data: Patient consent is mandatory before the data can be used for any purpose.
  • Regular Training: All organization employees must undergo mandatory training sessions. The training should guide the employees toward the appropriate use of data under the GDPR guidelines.

Significance of Data De-Identification

Data de-identification refers to removing any information from a dataset that can be used to identify a person. Generally, such information includes:

  • Names
  • Addresses
  • Phone Numbers
  • Date of Birth
  • Credit/Debit Card information
  • Social Security Number

These details are prevalent in most collected information, such as financial or medical records. In terms of healthcare, any records with personally identifiable information (PII) are termed Protected Health Information (PHI). All such data falls under the regulations defined by GDPR.

Timely and appropriate measures to ensure data anonymity can save organizations from hefty losses. GDPR compliance does not bind any de-identified data, and organizations are free from accountability in case of data breaches. This means organizations are not under any legal threat in the event of a cyber attack. Moreover, organizations are free to utilize this data without requiring consent or approval.

GDPRs Take on Patient Data De-identification

GDPR guidelines do not specify what data it considers as personal information. Rather it mentions that ‘personal data are any information which are related to an identified or identifiable natural person.’ The keyword here is ‘any information’ and must be broadly interpreted to include all possible cases. These can include medical records, medical history, and biometric data.

It does define certain criteria for the storage and utilization of patient records. Some of its key terms are accessibility and transparency. Under these, all personal information must be accessible and easy to understand for the data subject (patient). Moreover, they must also be provided relevant information on how this data is processed and used by the organization.

However, all regulations are strictly set for ‘identified information.’ Any data that is anonymized is not bound by GDPR guidelines. GDPR provides some information about what is considered de-identified information. Let’s understand the data identification spectrum in detail.

Identification Spectrum for Data

GDPR provides loose definitions for what anonymised data looks like but we can categorize it as follows:

  • Identified: Data that contains direct information regarding the patient. Information like name, address and social security number directly link back to the data subject.
  • Pseudo Anonymised: Data is considered pseudo-anonymised when it contains intermediate personal information. Such information can indirectly link back to the patient, either by linking to another table or via data transformation.
  • De-Identified: Data is considered de-identified if there is no way to link it to an individual. 

As per GDPR, data must strictly fall into the ‘de-identified’ category. Only de-identified data can be used for any internal processing.

Data Retention

GDPR also defines guidelines for storage and retention of data by healthcare organizations. All personal healthcare information must only be stored for as long as it is required and must be discarded/deleted after when not needed. Failure to comply can lead to fines of up to millions of dollars.

An alternative to discarding all data is data de-identification. When data is completely de-identified, it is no longer classified as ‘personal’ hence can be retained indefinitely. Anonymized data still holds immense value and provides medical-related insights for research and development.

Healthcare Data and Artificial Intelligence

A popular use of healthcare data is for training predictive machine learning models. AI in healthcare has achieved various breakthroughs in recent years and has developed several practical applications. It utilizes all sorts of medical-related information such as health records, medical history, radiology reports, drug tests etc. All these data collections are information-rich and are used in developing predictive models and efficient treatment plans.

However, the growth of AI in healthcare has created a high demand for healthcare data, making it a target for hackers and malicious actors. This means implementation of GDPR compliances, such as de-identification, is now more important than ever.

Removing Personally Identifiable Information from Medical Imaging

Most machine learning applications only require the underlying patterns present in medical-related information. These can be trained on de-identified data and can be used for applications like

  • Disease Prediction: Medical history and health records can be used to train models to classify medical conditions. These use disease history, vitals, and medical conditions to predict likelihood of developing additional medical conditions.
  • Processing Medical Images: Medical images such as MRI and X-ray scans are used for disease classification. These use computer vision techniques to detect and classify tumors, and diseases like Covid and Pneumonia.
  • Drug Discovery: Healthcare information is formulated into knowledge graphs consisting of various drugs and conditions. These help in creating relations between diseases and drugs which otherwise seem unlikely.

Data De-Identification as a Solution

As important as it may be, data de-identification can be quite a challenge. In most organizations, data is scattered across various mediums, such as tables, text, and images. Identifying and redacting sensitive information can also be time-consuming and expensive.

A better option is to opt for automated solutions such as iMerit’s de-identification-as-a-service. iMerit’s solution features an automated workflow that utilizes AI to privately and discreetly identify sensitive information from various documents, such as text and images. The detected regions are then redacted and removed, resulting in complete de-identification in accordance with  HIPAA guidelines.

iMerit also offers a hybrid approach which integrates human experts in the loop for quality assurance. The hybrid solution takes the same steps as the former but involves healthcare data specialists to correct any mis-identifications. The additional steps introduce a slight buffer in the pipeline but ensures complete anonymity of all entities present in the data.


iMerits data de-identification with human-in-the-loop

Final Thoughts

The General Data Protection Regulation (GDPR) is amongst the most well-reputed and strict data regulation authorities. It lays down strict ground rules for the storage, retention and processing of healthcare data for people in the EU.

One of its key requirements is the de-identification of all healthcare records unless required. De-identification refers to removal of any information in the data that would categorize it as Protected Health Information (PHI). Such information includes names, contact numbers, addresses and social security numbers. Under the GDPR guidelines, any personally identifiable information can not be stored by any organization without proper justification. Additionally, patients must be provided transparency into the storage and use of data and any processing must involve the patient’s consent.

De-identifying all such information brings it outside the bounds of the GDPR. It enables organizations to store the data indefinitely and utilize it for all research and machine learning applications. It also saves the organizations from any accountability in case of data breaches.

To protect your data, try the iMerit platform today.

Are you looking for data annotation to advance your project? Contact us today.