Blurred Lines

Words Matter in Data Protection

Forgive your enemies, but never forgive their names. John F. Kennedy

The names and words we use for data protection – encryption, masking, tokenizing all seem to be getting blurred together. And with these lines blurring, the complex assignment of securing data in a Big Data context just got all the more fuzzy and hard to navigate.

I’d like to go back to basics and see if we can reestablish some clear guidelines, definitions, and ways to firmly contrast and separate back out some key differences between Encryption, Masking, and/or Tokenization for data protection.

Blurred Lines

There continues to be a recognition that data protection at the data-level, also known as data-centric security, where you specifically identify and lock or mask data elements such as credit card numbers, social security numbers, personally identifiable information (names, phone numbers, addresses) is an effective data security strategy. Data centric security is especially suited to the context of Big Data where performance, business agility, and the tremendously varying nature of data presents roadblocks for reliance on traditional infrastructure security models such as Firewalls and host IPS.

Unfortunately, lines get blurred when this recognition of data-centric security is chased by a set of existing and new security vendors that want to sell into these Big Data data-centric opportunities, and want to claim the broadest coverage of security techniques. “Sure, we can do masking/encryption/tokenizing/anonymizing/redacting/de-identifying your data…”

But words matter. And these techniques achieve different business and technical goals that make blurring them harmful. Calling a penguin a swan doesn’t make it able to fly. And stating that your encryption algorithm(s) are data masking techniques doesn’t actually mean you are masking data. Contrariwise, masking data isn’t a lock, more like a cement block, because you shouldn’t ever be able to reverse a mask

What Do You Want:

The blurred lines don’t really end there. In the last three months, I’ve heard all of these notions essentially tossed around and blurred: {de-identifying, remediating, protecting, encryption, locking, masking, tokenizing, anonymizing, redacting.

Here’s a simple way to look at the technical trade-offs in these techniques.

Dataguise Technical Trade width=”473″ height=”287″ />
There are of course a whole series of other considerations, around performance, key management, usability, transparency, strength-of-security, “forward perfect secrecy”, but I’ll post next on the role of data masking in Hadoop, which is probably the least understood protection technique in the context of Big Data.