GDPR Technical Series #1: Anonymization and Pseudonymization

A fundamental tenet of the European Union’s General Data Protection Regulation (GDPR) —which went into effect on May 25, 2018— is the recommendation to pseudonymize personal data wherever possible. Articles 4, 6, 25, 32, 40 and 89 as well as Recitals 28, 29, 75, 78, 85 and 156 of the GDPR explicitly call out pseudonymization. Interestingly enough, the GDPR does not refer to anonymization anywhere in the text. While the GDPR deems pseudonymization sufficient, it does not preclude other means of data protection. (Recital 28).

In this blog post, we examine differences between these terms. In the next blog post of our GDPR Series, we cover common element-level protection techniques and map anonymization and pseudonymization to those techniques.

Anonymization vs Pseudonymization

First, let’s take the ideal case: anonymization. Anonymization is the transformation of data so that the data is no longer identifiable as being associated with a particular person. For anonymization to be effective, identification of the person associated with the data cannot be possible even with the addition of other knowledge about the anonymized data. The problem for data controllers and data processors with most cases of perfect anonymization is that the data is also rendered useless for any other analytics. Elimination of the ability to do valuable analytics could be one explanation for the GDPR’s omission of anonymization. Even then, however, anonymized data could still be useful for development and testing use cases.

Considering the original table below (Fig 1),

Last NameFirst NameEmployee IDEmail AddressTitleStart DateDepartmentSalaryVacation Available (days)
JonesEdward3565486[email protected]Technical Manager09/01/2013IT$13,000020
XuJason56544884[email protected]Architect07/01/2010Engineering$125,00015
StantonJoseph2484686[email protected]CEO01/03/2008HQ$400,00012
PowersRebecca4856459[email protected]Director02/02/2011Sales$140,00018

Fig 1 — Table before Anonymization

after Anonymization, the table would look like the one below (Fig 2).

Last NameFirst NameEmployee IDEmail AddressTitleStart DateDepartmentSalaryVacation Available (days)
JenkinsDavid34543593[email protected]General Manager07/02/2010IT$170,00015
CortesRamona63458245[email protected]Director05/02/2008Engineering$145,00015
HelleboidJean56455344[email protected]Manager04/02/2011HQ$155,00010
WatsonBrian34534887[email protected]Manager06/07/2004Sales$123,00019

Fig 2 — Table after Anonymization

Note that every field is transformed in Fig 2, except “Department.” Assuming each department consists of more than one person, getting back to the original data will not be possible, even with additional external information. However, if IT is a single-person department, then we have the person’s record. In our example, because all the values other than “Department” are anonymized, having the record is useless to anyone with access to the record.

Now consider pseudonymization. Let us assume that in addition to the “Department” column the “Salary” column is also not transformed. The following table (Fig 3) results from that action rather than the table in Fig 2.

Last NameFirst NameEmployee IDEmail AddressTitleStart DateDepartmentSalaryVacation Available (days)
JenkinsDavid34543593[email protected]General Manager07/02/2010IT$130,00015
CortesRamona63458245[email protected]Director05/02/2008Engineering$125,00015
HelleboidJean56455344[email protected]Manager04/02/2011HQ$400,00010
WatsonBrian34534887[email protected]Manager06/07/2004Sales$140,00019

Fig 3 — Table after Pseudonymization

In this case, if CEO Joseph Stanton (row 3) had not been in the data set, it would have been equivalent to an anonymized set. However, since the “Salary” column has not been transformed, the outlier in that column ($400,000) gives away the CEO’s identity and information. In other words, the knowledge that the CEO is likely to be paid well above everyone else essentially re-identifies the record after pseudonymization.

In the examples above, since the number of fields transformed is substantial in comparison with the total number of fields, the data, while usable for testing, is rendered useless for meaningful analysis. To be able to draw meaningful conclusions, the fields of interest in analysis need to be available without transformation —or at least be in the same range— so that aggregate results are the same.

The degree of anonymization and indeed whether a data set is anonymized or pseudonymized depends on the nature of the un-transformed data and how much it might reveal. At some point when there is sufficient additional information giving clues that identify the original value, the transformation would be pseudonymization rather than anonymization. In the example above, the sufficient additional information is the knowledge that the CEO is likely the highest paid employee in the company. Additional information might be public information or data available in other tables or data stores in the organization.

There are measures of the degree of anonymization, such as the family of Data Similarity measures, including k-anonymity, l-diversity, t-closeness and other criteria. There are also newer techniques such as Differential Privacy that de-identify the data. We will go into these measures and techniques in another blog post devoted to that subject.

As we saw in the discussion above, anonymization and pseudonymization are distinct approaches that protect the data as a whole, in the aggregate. The anonymization and pseudonymization effects are achieved by applying transformations at the element level. We will delve into these element level techniques in the next blog post and map those techniques to anonymization and pseudonymization.

Dataguise’s DgSecure offers both anonymization and pseudonymization. To learn more about DgSecure’s data protection options, please contact us for details.