GDPR Technical Series #4: Finding Personal Data – Types and Mechanics Dec 3, 2018
In the last post of our GDPR series, we discussed several closely related terms pertaining to finding data, and we distinguished between them. GDPR defines two types of personal data:
- “basic” personal data, described in Article 6, and
- special categories of personal data, described in Article 9.
The “basic” personal data includes items such as (refer to European Commission):
- a name and surname
- a home address
- an email address such as [email protected]
- an identification card number
- location data (for example the location data function on a mobile phone)
- an Internet Protocol (IP) address
- a cookie ID;
- the advertising identifier of your phone
- data held by a hospital or doctor, which could be a symbol that uniquely identifies a person
These items are typically those that identify a person uniquely. Article 6 of the GDPR governs the processing of this data, typically allowing processing where one or more of the following exist:
- consent of the subject
- legal requirement(s)
- legitimate interest(s) of the controller.
The special categories of personal data include items such as:
- racial or ethnic origin
- political opinions
- religious or philosophical beliefs
- trade union membership
- processing of genetic data
- biometric data for the purpose of uniquely identifying a natural person
- sex life or sexual orientation.
The special categories of personal data are items that require additional protection and consideration, as these attributes of a person could be used to discriminate against or otherwise target a person. The special categories of personal data are also referred to as sensitive personal data – as they bring an additional level of sensitivity – and therefore require additional consideration.
Under GDPR Article 9, processing of special categories of personal data is prohibited, unless one of these conditions apply:
- explicit consent is given by the subject
- the data has already been made explicitly public by the subject, or
- any one of a number of specific legal or public interest situations (spelled out in Article 9) that require processing of this data exists.
In general, compared to Article 6, Article 9 imposes tighter conditions for allowing processing.
Lastly, GDPR Article 10 gives additional protection for data relating to criminal convictions.
From the point of view of automating the detection of personal data, both “basic” personal data and the special categories pose technical challenges. Personal data such as Names and Addresses can often result in false positives unless the detection algorithm is sophisticated. For example, in an unstructured document, a given name such as “April” without more context might be mistaken for the name of a month. Similarly, since many street or city names are also legitimate last names of individuals, confusion can arise without a sufficiently strong detection method. An example of this is an address such as 12343 Washington Blvd., Fremont CA. Both Fremont and Washington are legitimate last names, street names, or city names. The ordering of the elements in street addresses also varies widely across countries. Since GDPR covers the European Union (EU), and the EU has 28 member countries (including the UK, which is committed to implementing GDPR even after leaving the EU), the addresses in these countries will need to be detected. Therefore detecting “basic” personal data requires a sophisticated approach.
The special categories of personal data present an even greater challenge to automated detection methods. The reason for that is most of the items in the special categories don’t have a tight, formally expressible definition. For example, finding someone’s political leanings as part of an automated scanning program, especially in an unstructured document, is a non-trivial exercise, though a human reading the same document can easily glean this information. Indeed this is where machine learning (ML), training of the ML module, and customization play heavy roles. The automated system of detection must be highly customizable and trainable in order to be effective in finding data that fall under the special category.
Lastly, all of the above algorithms need to operate on data sizes that are exploding. Scalability of any detection algorithm is critical for it to be successful in a real-world GDPR production context.
So far in the context of GDPR, we have covered detection of personal data and protection methods for personal data. In the next post of this series, we will be discussing other aspects of GDPR, namely, breach detection and reporting.