The General Data Protection Regulation (GDPR) is often considered the "gold standard" in data protection regulation. Simultaneously, it is often also perceived to hinder certain uses of data that are in the interest of individuals, firms or even society at large. However, data protection and data use do not need to oppose each other, and anonymisation and pseudonymisation can help to bring them together.
The GDPR principles allow data processing based on one of six lawful grounds and require data minimisation, purpose specification and storage limitation. These principles limit what organisations can legally do with the personal data they collect. Anonymisation turns personal data into non-personal data and therefore takes it out of scope of the GDPR. Pseudonymised data, in contrast, remains personal data, but can lead to more relaxed obligations for the data controller. In the following, we explore, first, anonymisation and, second, pseudonymisation, as well as how SMPC can enable them.
Anonymisation constitutes a process of irreversible de-identification. Non-personal data does not fall within the scope of the GDPR and may be processed without reliance on a lawful ground, such as consent. It does not underlie the strict storage and purpose limitation requirements of the GDPR and may be kept and analysed indefinitely. It may be used for a myriad of purposes, such as Machine Learning algorithms, statistical analysis, or even direct monetisation through selling data and/or data-generated insights.
However, the definition of anonymised data is highly disputed and authoritative bodies have not yet provided much clarity around the cutoff for and privileges associated with anonymisation. Authorities issue different options: while the German BfDI has recognised that it will need to be a relative rather than an absolute term, the Article 29 Working Party (WP29), which has since been replaced by the European Data Protection Board (EDPB), has set the bar very high, leaning towards an absolute concept of identifiability.
The GDPR itself does not provide much clarity on the definition of anonymised data, stating in Recital 26 to the GDPR that as “information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable”. Neither the data controller nor any other party should be able to identify the data subject directly or indirectly.
With an absolute concept of identifiability, all means and possibilities available for the data controller or any third party to connect information to a data subject render data personal. Even a far-fetched theoretical possibility of identification keeps the data within the scope of the GDPR. Accordingly, data can only be considered fully anonymised, if the original raw data is irreversibly deleted. This begs the question whether this form of anonymisation is really achievable while preserving some utility of the data.
The relative concept of identifiability follows a risk assessment and looks at the realistic chances of re-identification of the data subject. Only the means and possibilities that could realistically be applied by the data controller or a third party need to be factored in. Hence, if a recipient is unable to get access to the additional information to re-identify the data subject or to the original raw data, the processed, transferred version of the data could be considered non-personal data, even if the original raw data still exists. For example, if someone only gets access to an encrypted data set without access to the decryption key, the encrypted data set can be considered anonymised. Nonetheless, it is important to note that the original data still remains personal data in scope of the GDPR.
SMPC is a way of randomly breaking up confidential data such that certain computations are possible on encrypted data shares. Each share or multiple shares provide no information about the original data. The shares are divided amongst the participating parties, the computation can only be performed when all parties collaborate. Only the output of the computation is exposed, but not the original underlying data.
A classic example is the calculation of the average salary of three individuals, Allie, Brian and Caroline. They each split their salaries in three randomly generated shares. Considering Allie’s $100k salary, this generates $20k, $30k, $50k. Allie keeps the $50k share for herself, gives the $30k share to Brian and $20k to Caroline. Brian and Caroline execute a similar splitting and distribution function, leading to each person holding three secret shares, one from each participant.
After calculating the sum of the bottom line ($600K) the average ($200K) can be found, without Allie, Brian or Caroline exposing any information about their salaries, the original raw information.
According to the strict WP29 opinion that reflects an absolute concept of identifiability, this method can only be considered anonymisation if each party irreversibly destroys the original data: Allie, Caroline and Brian should forget what their original salaries were. So how can data have any use at all after “full” anonymisation?
The WP29 opinion is not legally binding, and there are diverging opinions. For example, the recent EU funded Scaleable Oblivious Data Analytics (SODA) project boasts the opinion that “the fact that the data fragmentation procedure [secret-sharing step] as such is processing of personal data does not mean that the output data has to fall under the scope of the GDPR. On the contrary, based on the arguments put forward here the data shards that have undergone the partitioning are considered to be non-personal data.”
Focusing merely on the output data as anonymized data, even if the original input data has not been deleted, opens the door to a whole range of viable data use cases for data anonymised with SMPC. However, the legal uncertainty relating to absolute and relative concepts of identifiability mean that many of those uses are currently not explored.
Pseudonymisation is a way of reducing privacy risks and accordingly relaxes some of the principles. The GDPR mentions it a whopping 15 times, giving a clear indication of its importance. It can be seen as the regulator’s incentive to keep data processing inside the GDPR, rather than trying to take it out of scope with anonymisation. It is defined as:
“The processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person”
The concept goes beyond the protection of ‘the real world person identity’ to also cover the protection of indirect identifiers relating to a data subject (e.g. online unique identifiers). The reversal of pseudonymisation should not be trivial for any third parties that do not have access to the ‘additional information’ needed to map identifiers to individuals.
In contrast to anonymisation, the use of SMPC for pseudonymisation is much more certain. The main requirement is that additional information needed to re-link a data subject with pseudonymised data has to be kept separately and subject to technical and organisational measures.
Instead of someone’s annual salary as the original information, imagine Pseudonym (D) being the original input. This pseudonym is split into distinct shares(D1, D2, D3,..., Dn), which are distributed among different recipients. The original pseudonym can only be retrieved if the multiple recipients collaborate. Only in this case, can the data subject be identified, one recipient alone does not have enough information to achieve this. Hence, SMPC can add as a technical and organisational measure to protect additional information.
The GDPR specifies in which cases pseudonymised data is subject to lower requirements. As a security measure, pseudonymisation renders data breaches “unlikely to result in a risk to the rights and freedoms of natural persons”, and reduces liability and notification obligations for data breaches (Art. 32 GDPR). It is considered a technical and organizational measure to help enforce the data minimization principle and compliance with data protection by design and by default (dpbdd) obligations (Art. 25 GDPR).
Other benefits include:
“Legitimate interest” argumentation: Legitimate interest is one of the lawful bases for data-processing (Art. 6 GDPR), which can serve as an alternative to obtaining "consent". With “consent”, businesses risk facing some uncertainty over whether their opt-ins meet the GDPRs requirements of freely given, specific, informed, unambiguous consent. Furthermore, low consent rates can lead to incomplete and biased data sets, which can be detrimental for purposes like data analytics. It is impossible to draw accurate conclusions about customer churn or the effectiveness of a marketing campaign when using an incomplete dataset, covering consented users only. The application of proper pseudonymisation techniques functions as a supporting factor for a “legitimate interest” argumentation.
Pseudonymisation can also relax the restrictions for specified purposes: The GDPR’s principles of purpose specification and data minimization restrict data-use to the original collection-purpose, meaning that in principle, data must be deleted as soon as that purpose has been achieved. This stands in the way of using personal data for additional purposes such as statistical research, data science, machine learning, fraud detection etc. Pseudonymisation can be seen as an “appropriate safeguard” which can help make new data-purposes compatible with the initial collection-purpose. Pseudonymised data may also be archived for statistical processing, public interest, scientific or historical research.
Facilitate international data flows: Since the European Court of Justice (ECJ) declared the EU-US privacy shield invalid in the Schrems II decision, companies transferring EU citizen data to the US, including those using US cloud service providers, are grappling with insecurity. Businesses need to fall back on Art. 46 transfer tools such as Standard Contractual Clauses and the application of ‘additional safeguards’ to ensure an adequate level of data protection in the receiving country. Pseudonymisation is considered an ‘additional safeguard’, therefore allowing for a safe, GDPR-compliant transfer of data to non-EEA countries such as the US.
Facilitate data flows across organisations: Even if many businesses have control over their own systems, they are also responsible for those of the third party vendors they supply data to. When only supplying pseudonymous data to their partners, businesses strongly decrease the risk of re-identification, achieve a higher degree of data privacy and are in compliance with the GDPR principles of purpose limitation, integrity and confidentiality. The partner gets all the information they need, but nothing beyond that. The sending business stays in full control and does not need to worry about their partners' data processes.
There is no one-size-fits-all pseudonymisation solution: Each data-set, its processing context and utility or purpose should be reviewed on a case-to-case basis to find the most fitting technique or combination of techniques. If multiple parties are involved in the handling of personal data, SMPC and homomorphic encryption are two forms of pseudonymisation which could offer the right level of protection.
The benefits of pseudonymisation under the GDPR are manifold. Whilst new pseudonymisation techniques are continuously being developed, extensive research has been done on the current state-of-the-art techniques by institutes such as the European Union Agency for Cybersecurity (ENISA). It is essential for businesses to recognise that they can use these techniques to meet their GDPR obligations while opening up the door to more multifaceted data handling processes. This is the best way to make the most of data assets.