Securing Enterprise Data Lakes Jun 4, 2015
Real Lakes and Data lakes:
In California, there seems to be an awful lot of lakes, even tiny ones, and no ponds. From my New Hampshire childhood, I remember small lakes as “ponds.” Which lead me to go look up the actual difference between the two… According to limnology (the study of water bodies), the distinguishing characteristic of a lake is that it’s much deeper than a pond: a lake needs to be so deep as to create an “aphotic” region, literally, a place under water deep enough so the sun don’t shine, and no life can live. Ponds, in contrast, are shallow enough to always provide enough light for plants to grow from the bottom across the whole surface of the area.
What Makes Enterprise Data Lakes Dark/Aphotic?
One of the essential characteristics of Hadoop – it’s flexibility in incorporating schema-less data in any number of data formats – makes for one of it’s largest security challenges. The water in there, if you will, is dark and murky. How do you know and identify sensitive data, be it people names, addresses, or highly confidential data, such as credit card PANs, or insurance agent fraud notes*, when that data may be unannounced and freeform in Hadoop (nested in unstructured data in AVRO, or part of a user comment field in ORC files from web clickstreams). Secondly, how do you track who’s reaching those data sets, in what frequency is sensitive data being accessed, joined, copied in your Hadoop deployment?
What’s Down There?
The Hortonworks Data Governance Initiative is chartered to provide improved visibility, monitoring, and sensitive data audit for enterprises incorporating privacy, purchasing, credit card, or health information in their enterprise data lakes. To enhance DGI, Hortonworks and Dataguise are partnered to bring new security and visibility to data lakes to illuminate your data, even in the dark regions, and attach Apache Ranger’s scalable access and authorization controls to manage how this data is accessed securely.
Dataguise discovery (inspecting all files in all repositories to accurately assess security risk) is actively being coupled with Apache Knox and Apache Ranger to determine who can access sensitive data, and build new monitoring capabilities to manage and alert for real-time access to sensitive data that may indicate security risks or data governance violations.
* In fact, Dataguise discovery has been deployed in Hortonworks HDP platforms in leading auto insurance and online web platforms to specifically catch sensitive driver data (PII) and credit card (PCI) data in examples mentioned above.
See for yourself
We will be highlighting some specific use cases related to Securing in enterprise data lakes that will examine these discovery, authorization, and monitoring solutions across real-world examples in healthcare, insurance, technology, and online retail data lakes in an upcoming Hortonworks Data Governance Initiative (DGI) webinar coming on August 20th. Details for that webinar soon!