Sensitive Big Data – Known knowns and Unknown unknowns…

During a press conference thirteen years ago, Defense Secretary Donald Rumsfeld famously answered a question about how military decision-making was impacted by the lack of evidence for linking Iraq with nuclear weapons:

There are known knowns; there are things we know we know … there are known unknowns; … we know there are some things we do not know. But there are also unknown unknowns — the ones we don’t know we don’t know. And if one looks throughout the history of our country and other free countries, it is the latter category that tend to be the difficult ones”

In conversations with Hadoop customers across banking, insurance, technology, healthcare, etc. trying to secure data in Hadoop, I often get a sense that they share a similar degree of opacity in knowing what degree of sensitive data and risk sit inside Hadoop. In fact, one of the very most important reasons for leveraging Hadoop – the platform’s flexibility in processing data without fixed schemas – creates options and inputs that can greatly increase this opacity and risk.

Putting Rumsfeld’s notions of unknowns into a matrix, there are some interesting parallels between known information in a military context (where the bad guys are), and known data in the Big Data context (where the sensitive data lies). And the outer layer (the unknown unknowns) looks even more applicable in the context of data sitting inside meta-data formats that do not create clear boundaries or labels for data.

See the matrix immediately below. I ended up with four boxes, and the creation of a “fourth” category, call it the “known knowns”. Rumsfeld didn’t really have a reason or need to point out things known inside known context, but we certainly do for Big Data, it’s the easiest box to manage and secure: the known data inside the known meta-data.


Dataguise Known and unknown Data width=”900″ height=”643″ />


Where things get really tricky for Hadoop enterprises is on the left-hand side, and especially the top left box– classic Rumsfeldian Unknown Unknowns! With enterprises bringing, merging, multiplexing sensitive data in Hadoop, in formats that do not structure the location of risk, with content that could include PCI, PII, HIPAA data that could leak, lurk, or add risk of exposure if not properly protected and/or audited.

I’ll post next on some strategies for tackling the Unknown knows at scale.